Keyphrase Extraction in NLP
source link: https://www.geeksforgeeks.org/keyphrase-extraction-in-nlp/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
In this article, we will learn how to perform key phrase and keyword extraction from text using natural language techniques. We will first discuss about keyphrase and keyword extraction and then look into its implementation in Python. We would be using some of the popular libraries including spacy, yake, and rake-nltk.
Keyword Extraction
Keyphrase or keyword extraction in NLP is a text analysis technique that extracts important words and phrases from the input text. These key phrases can be used in a variety of tasks, including information retrieval, document summarization, and content categorization. This task is performed in two stages:
- Candidate Generation: This process involves the identification of all possible keywords from the input text.
- Keyphrase Ranking: After the candidate keywords are generated, they are ranked in order of importance for the identification of the best keywords.
Some of the popular key phrase generating tools and algorithms are RAKE, YAKE, spaCy, Textacy.
RAKE stands for Rapid Automatic Keyword Extraction and it is a frequency-based key phrase extractor. To implement RAKE we will use rake-nltk library. This library can be installed by using the following command.
pip install rake-nltk
Following is the python implementation of keyphrases extraction using rake-nltk.
- Python3
# Importing libraries from rake_nltk import Rake from wordcloud import WordCloud import matplotlib.pyplot as plt # Initializing the Rake instance rake = Rake() # Input text input_text = ''' NLP stands for Natural Language Processing. It is the branch of Artificial Intelligence that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format. Natural Language Processing started in 1950 When Alan Mathison Turing published an article in the name Computing Machinery and Intelligence. It is based on Artificial intelligence. It talks about automatic interpretation and generation of natural language. As the technology evolved, different approaches have come to deal with NLP tasks. ''' # Extracting keywords and phrases rake.extract_keywords_from_text(input_text) keywords = rake.get_ranked_phrases() # Displaying the keywords print (keywords) # Generate WordCloud wordcloud = WordCloud().generate( ' ' .join(keywords)) # Display the WordCloud plt.figure(figsize = ( 10 , 10 )) plt.imshow(wordcloud, interpolation = 'bilinear' ) plt.axis( 'off' ) plt.show() |
Output:
['alan mathison turing published', 'natural language processing started', 'natural language processing',
'name computing machinery', 'process human languages', 'natural language', 'human languages',
'technology evolved', 'nlp tasks', 'nlp stands', 'machine understand', 'different approaches',
'automatic interpretation', 'audio format', 'artificial intelligence', 'artificial intelligence',
'intelligence', 'text', 'talks', 'gives', 'generation', 'form', 'deal', 'come', 'branch', 'based',
'article', 'ability', '1950']
Keyword Extraction using RAKE
Note:
If you face any lookup error while running the above code, you can fix that by running the following code.
import nltk
nltk.download('stopwords')
nltk.download('punkt')
YAKE stands for Yet Another Keyword Extractor and it is an unsupervised approach for automatic keyword extraction by leveraging text features. To implement YAKE we will use the yake library. This library can be installed using the following command.
pip install yake
Following is the Python implementation of keyphrases extraction using the Yake library.
- Python3
# Importing libraries from wordcloud import WordCloud import matplotlib.pyplot as plt import yake # Initializing the YAKE instance yake_kw = yake.KeywordExtractor() # Input text input_text = ''' NLP stands for Natural Language Processing. It is the branch of Artificial Intelligence that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format. Natural Language Processing started in 1950 When Alan Mathison Turing published an article in the name Computing Machinery and Intelligence. It is based on Artificial intelligence. It talks about automatic interpretation and generation of natural language. As the technology evolved, different approaches have come to deal with NLP tasks. ''' # Extracting keywords KeyWords = yake_kw.extract_keywords(input_text) # Displaying the keywords print (KeyWords) # Extracting keywords keywords = [kw for kw, _ in KeyWords] # Generate WordCloud wordcloud = WordCloud().generate( ' ' .join(keywords)) # Display the WordCloud plt.figure(figsize = ( 10 , 10 )) plt.imshow(wordcloud, interpolation = 'bilinear' ) plt.axis( 'off' ) plt.show() |
Output:
[('Natural Language Processing', 0.02100249013859125), ('Language Processing', 0.04163335302639552),
('Natural Language', 0.048148820863336377), ('Artificial Intelligence', 0.06657427591685054),
('Alan Mathison Turing', 0.06812525384060124), ('Language Processing started', 0.07604425290902747),
('human languages', 0.08215351904804695), ('NLP stands', 0.09173112596477705),
('Language', 0.10178153594306494), ('process human languages', 0.11865807800247614),
('Processing', 0.12586811799925435), ('Intelligence', 0.12825620909446891),
('Natural', 0.13778435888897436), ('Alan Mathison', 0.15153101048626974),
('Mathison Turing', 0.15153101048626974), ('Computing Machinery', 0.15153101048626974),
('Mathison Turing published', 0.15160281730925312), ('languages', 0.1526723039145974),
('Artificial', 0.15269328890550202), ('NLP', 0.18058428305612767)]
Keyword Extraction using YAKE
spaCy
SpaCy is a free, open-source library specifically designed for efficiently performing various NLP tasks. It is usually used for setting up production-level pipelines using pre-trained models for tasks like information extractors or reviews of sentimental analysis systems. It can also be used to extract key phrases and words from the text input. This library can be installed using the following commands.
pip install -U spacy
python -m spacy download en_core_web_sm
Following is the Python implementation of keyphrases extraction using SpaCy.
- Python3
# Importing libraries import spacy from wordcloud import WordCloud import matplotlib.pyplot as plt # Initializing the spaCy model instance nlp = spacy.load( 'en_core_web_sm' ) # Input text input_text = ''' NLP stands for Natural Language Processing. It is the branch of Artificial Intelligence that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format. Natural Language Processing started in 1950 When Alan Mathison Turing published an article in the name Computing Machinery and Intelligence. It is based on Artificial intelligence. It talks about automatic interpretation and generation of natural language. As the technology evolved, different approaches have come to deal with NLP tasks. ''' # Creating a spaCy document spacy_doc = nlp(input_text) # Initializing keywords list variable keywords = [] # Extracting keyphrases for chunk in spacy_doc.noun_chunks: if chunk.text.lower() not in nlp.Defaults.stop_words: keywords.append(chunk.text) # Displaying the keywords print (keywords) # Generate WordCloud wordcloud = WordCloud().generate( ' ' .join(keywords)) # Display the WordCloud plt.figure(figsize = ( 10 , 10 )) plt.imshow(wordcloud, interpolation = 'bilinear' ) plt.axis( 'off' ) plt.show() |
Output:
['\nNLP', 'Natural Language Processing', 'the branch', 'Artificial Intelligence', 'the ability',
'human languages', 'Human languages', 'the form', 'text', 'audio format', 'Natural Language Processing',
'Alan Mathison Turing', 'an article', 'the name', 'Computing Machinery', 'Intelligence',
'Artificial intelligence', 'automatic interpretation', 'generation', 'natural language',
'the technology', 'different approaches', 'NLP tasks']
Keyword Extraction using spaCy
Textacy
Textacy is a Python library that provides a simple and intuitive interface for performing various natural language processing (NLP) tasks. It is built on top of spaCy, another popular NLP library, and offers additional functionalities and utilities to simplify common NLP workflows.
- Python3
#import textacy.ke import textacy from textacy import * #Load a spacy model, which will be used for all further processing. en = textacy.load_spacy_lang( "en_core_web_sm" ) # Input text input_text = ''' NLP stands for Natural Language Processing. It is the branch of Artificial Intelligence that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format. Natural Language Processing started in 1950 When Alan Mathison Turing published an article in the name Computing Machinery and Intelligence. It is based on Artificial intelligence. It talks about automatic interpretation and generation of natural language. As the technology evolved, different approaches have come to deal with NLP tasks. ''' #convert the text into a spacy document. doc = textacy.make_spacy_doc(input_text, lang = en) #Print the keywords using TextRank algorithm, as implemented in Textacy. print ( "Textrank output: \n" , textacy.extract.keyterms.textrank(doc, normalize = "lemma" , topn = 5 )) # structured information extraction (textacy.extract) keywords = [kps for kps, weights in textacy.extract.keyterms.textrank(doc, normalize = "lemma" )] # Generate WordCloud wordcloud = WordCloud().generate( ' ' .join(keywords)) # Display the WordCloud plt.figure(figsize = ( 10 , 10 )) plt.imshow(wordcloud, interpolation = 'bilinear' ) plt.axis( 'off' ) plt.title( 'Textrank' ) plt.show() #Print the key words and phrases, using SGRank algorithm, as implemented in Textacy print ( "SGRank output: \n" , [kps for kps, weights in textacy.extract.keyterms.sgrank(doc, topn = 5 )]) # structured information extraction (textacy.extract) keywords = [kps for kps, weights in textacy.extract.keyterms.sgrank(doc, normalize = "lemma" )] # Generate WordCloud wordcloud = WordCloud().generate( ' ' .join(keywords)) # Display the WordCloud plt.figure(figsize = ( 10 , 10 )) plt.imshow(wordcloud, interpolation = 'bilinear' ) plt.axis( 'off' ) plt.title( 'SGRank' ) plt.show() |
Output:
Textrank output:
[('Natural Language Processing', 0.044047486408196675), ('Alan Mathison Turing', 0.04176581650758854),
('Artificial Intelligence', 0.04001459501418585), ('human language', 0.03494095073620351),
('NLP task', 0.03217996705388366)]
Keyword Extraction using Textacy Textrank
SGRank output:
['Natural Language Processing', 'Alan Mathison Turing', 'human language', 'NLP', 'Artificial Intelligence']
Keyword Extraction using Textacy SGRank
Recommend
-
137
"Do The Right eXtraction" - don't remember what set of tar flags or where to pipe the output to extract it? no worries! TL;DR pip install dtrx dtrx yolo.tar.gz This is a copy-pas...
-
85
:snowflake: PcapXray - A Network Forensics Tool - To visualize a Packet Capture offline as a Network Diagram including device identification, highlight important communication and file extraction
-
73
GitHub is where people build software. More than 28 million people use GitHub to discover, fork, and contribute to over 79 million projects.
-
32
Technical Advisory: "ROHNP"- Key Extraction Side Channel in Multiple Crypto Libraries Vendors affected: Multiple Versions affected: Multiple Author: Keegan Ryan <keegan.ryan[at]nccgroup[dot]trust> <@...
-
108
README.md
-
35
README.rst Newspaper3k: Article scraping & curation
-
222
README.md Apify SDK: The scalable web crawling and scraping library for JavaScript
-
9
Search Engine Land » Channel » SEO » Yoast SEO 17.9 adds keyphr...
-
2
Introduction With the rapid growth of social network platforms, more and more people tend to share their experiences and emotions online. So, the task of emotion analysis of online texts is crucial in
-
5
SmartData Collective >
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK