Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Knoldus Blog Audio

Reading Time: 4 minutes

SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you’re operating with plenty of text, you’ll eventually want to know more about it. For example, what’s it about? What do the phrases suggest in context? Who is doing what to whom? Which texts are just like every other?

Certainly, spaCy can resolve all the problems stated above.

Linguistic Features in SpaCy

SpaCy goes about as an all-inclusive resource for different tasks used in NLP projects. For instance, Tokenization, Lemmatisation, Part-of-speech(POS) labeling, Name substance acknowledgment, Dependency parsing, Sentence Segmentation, Word-to-vector changes, and other cleaning and standardization text methods.

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Installation of SpaCy

!pip install -U spacy

!pip install -U spacy-lookups-data

!python -m spacy download en_core_web_sm

Once we’ve downloaded and installed a model, we will load it via spacy.load(). spaCy has different types of pre-trained models. In addition, the default model for the English language is en_core_web_sm.

Moreover, the NLP object is a language instance of the spaCy model. And, this will return a Language object containing all components and data needed to process text.

import spacy
nlp = spacy.load('en_core_web_sm')

Tokenization in SpaCy

Tokenization is the task of splitting a text into meaningful segments called tokens. The input to the tokenizer is a Unicode text and the output is a Doc object.

In addition, a Doc is a sequence of Token objects. Each Doc consists of individual tokens, and we can iterate over them.

doc = nlp('We are learning SpaCy library today')
for token in doc:
    print(token.text)

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Part-of-speech tagging

Part of speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence.

doc = nlp('We are learning SpaCy library today')
for token in doc:
print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop}')

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Dependency Parsing

Dependency Parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headworks and their dependents.

The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the head of the sentence. And, headwork is related to all other words.

doc = nlp('We are learning SpaCy library today')
for chunk in doc.noun_chunks:
print(f'{chunk.text:{30}} {chunk.root.text:{15}} {chunk.root.dep_}')

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Lemmatization

Work-related tokenization, lemmatization is the method of decreasing the word to its base form, or origin form. This reduced form or root word is called a lemma.

For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma.

Lemmatization is necessary because it helps to reduce the inflected forms of a word. So that they can be analyzed as a single item. It can also help you normalize the text.

doc = nlp('We are learning SpaCy library today')
for token in doc:
print(token.text, token.lemma_)

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Sentence Boundary Detection

Sentence Segmentation is the process of locating the start and end of sentences in a given text. This allows you to divide a text into linguistically meaningful units. SpaCy uses the dependency parse to determine sentence boundaries. And to extract sentences in spaCy sents property is used.

doc = nlp('First Sentence. Second Sentence. Third Sentence.')
print(list(doc.sents))

Is SpaCy Python NLP Any Good? Seven Ways You Can Be Certain

Named Entity Recognition

Named Entity Recognition (NER) is the process of locating named entities in unstructured text. After that classifying them into pre-defined categories. Such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

In order to improve the keyword search, we populate tags for a set of documents. Named entities are available as the ents property of a Doc.

doc = nlp('We are learning SpaCy library today')
for ent in doc.ents:
print(ent.text, ent.label_)

Similarity

The similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word.

As you can see in the example, The words “dog”, “cat” and “banana” are all pretty common in English, so they’re part of the pipeline’s vocabulary, and come with a vector. The word “afskfsd” on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0.

tokens = nlp("dog cat banana afskfsd")
for token in tokens: print(token.text, token.has_vector, token.vector_norm, token.is_oov)

Conclusion

In conclusion, spaCy is a modern, reliable NLP framework that quickly became the standard for doing NLP with Python. Its main advantages are speed, accuracy, extensibility.

We have gained insights into linguistic Annotations like Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Entity recognition, Dependency parsing, Sentence segmentation, and Similarity.

Linguistic Features in SpaCy

Installation of SpaCy

Tokenization in SpaCy

Part-of-speech tagging

Dependency Parsing

Lemmatization

Sentence Boundary Detection

Named Entity Recognition

Similarity

Conclusion

Recommend

EthCC[4]专题分享(2) 无状态以太坊与门户网络

Pantabox offers easier frontend for Pantavisor Linux IoT container software

GitHub - Sliim/sleemacs: Just my emacs23 configuration :)

GitHub - dash-docs-el/dash-docs: A elisp library that exposes functionality to w...

企业新媒体运营的33条总结思考

How years of fighting every wildfire helped fuel the Western megafires of today

项目周刊｜以太坊在两天内销毁了新币发行量的36%

How a rigorous programmer can uncover the truth that Richard S. Sutton missed

EIP-1559实施后 Gas为什么没有剧烈下降

外媒：美国新的比特币税收计划可能扼杀更环保的区块链技术

About Joyk