6

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

 2 years ago
source link: https://blog.knoldus.com/is-spacy-python-nlp-any-good-seven-ways-you-can-be-certain/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Knoldus Blog Audio
Reading Time: 4 minutes

SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you’re operating with plenty of text, you’ll eventually want to know more about it. For example, what’s it about? What do the phrases suggest in context? Who is doing what to whom? Which texts are just like every other?

Certainly, spaCy can resolve all the problems stated above.

Linguistic Features in SpaCy

SpaCy goes about as an all-inclusive resource for different tasks used in NLP projects. For instance, Tokenization, Lemmatisation, Part-of-speech(POS) labeling, Name substance acknowledgment, Dependency parsing, Sentence Segmentation, Word-to-vector changes, and other cleaning and standardization text methods.

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Installation of SpaCy

!pip install -U spacy
!pip install -U spacy-lookups-data
!python -m spacy download en_core_web_sm

Once we’ve downloaded and installed a model, we will load it via spacy.load(). spaCy has different types of pre-trained models. In addition, the default model for the English language is en_core_web_sm.

Moreover, the NLP object is a language instance of the spaCy model. And, this will return a Language object containing all components and data needed to process text.

import spacy
nlp = spacy.load('en_core_web_sm')

Tokenization in SpaCy

Tokenization is the task of splitting a text into meaningful segments called tokens. The input to the tokenizer is a Unicode text and the output is a Doc object.

In addition, a Doc is a sequence of Token objects. Each Doc consists of individual tokens, and we can iterate over them.

doc = nlp('We are learning SpaCy library today')
for token in doc:
    print(token.text)

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Part-of-speech tagging

Part of speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence.

doc = nlp('We are learning SpaCy library today')
for token in doc:
print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop}')

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Dependency Parsing

Dependency Parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headworks and their dependents.

The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the head of the sentence. And, headwork is related to all other words.

doc = nlp('We are learning SpaCy library today')
for chunk in doc.noun_chunks:
print(f'{chunk.text:{30}} {chunk.root.text:{15}} {chunk.root.dep_}')

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Lemmatization

Work-related tokenization, lemmatization is the method of decreasing the word to its base form, or origin form. This reduced form or root word is called a lemma.

For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma.

Lemmatization is necessary because it helps to reduce the inflected forms of a word. So that they can be analyzed as a single item. It can also help you normalize the text.

doc = nlp('We are learning SpaCy library today')
for token in doc:
print(token.text, token.lemma_)

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Sentence Boundary Detection

Sentence Segmentation is the process of locating the start and end of sentences in a given text. This allows you to divide a text into linguistically meaningful units. SpaCy uses the dependency parse to determine sentence boundaries. And to extract sentences in spaCy sents property is used.

doc = nlp('First Sentence. Second Sentence. Third Sentence.')
print(list(doc.sents))

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Named Entity Recognition

Named Entity Recognition (NER) is the process of locating named entities in unstructured text. After that classifying them into pre-defined categories. Such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

In order to improve the keyword search, we populate tags for a set of documents. Named entities are available as the ents property of a Doc.

doc = nlp('We are learning SpaCy library today')
for ent in doc.ents:
print(ent.text, ent.label_)

Similarity

The similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. 

As you can see in the example, The words “dog”, “cat” and “banana” are all pretty common in English, so they’re part of the pipeline’s vocabulary, and come with a vector. The word “afskfsd” on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0.

tokens = nlp("dog cat banana afskfsd")
for token in tokens: print(token.text, token.has_vector, token.vector_norm, token.is_oov)

Conclusion

In conclusion, spaCy is a modern, reliable NLP framework that quickly became the standard for doing NLP with Python. Its main advantages are speed, accuracy, extensibility.

We have gained insights into linguistic Annotations like Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Entity recognition, Dependency parsing, Sentence segmentation, and Similarity.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK