Text Summarization based on the frequency of the occurances of the words and their ranking¶
Import NLTK Package¶
In [91]:
import nltk
Download Stopwords and Punctuations¶
In [92]:
nltk.download('stopwords')
nltk.download('punkt')
Out[92]:
Input Text¶
In [93]:
text = """Today, with of Digitization everything, 80 percent the data being created is unstructured. Audio, Video, our social footprints, the data generated from conversations between customer service reps, tons of legal document’s texts processed in financial sectors are examples of unstructured data stored in Big Data.
Organizations are turning to natural language processing (NLP) technology to derive understanding from the myriad of these unstructured data available online and in call-logs.
Natural language processing (NLP) is the ability of computers to understand human speech as it is spoken. NLP is a branch of artificial intelligence that has many important implications on the ways that computers and humans interact. Machine Learning has helped computers parse the ambiguity of human language.
Apache OpenNLP, Natural Language Toolkit(NLTK), Stanford NLP are various open source NLP libraries used in real world application below.
Here are multiple ways NLP is used today:
The most basic and well known application of NLP is Microsoft Word spell checking.
Text analysis, also known as sentiment analytics is a key use of NLP. Businesses are most concerned with comprehending how their customers feel emotionally and use that data for betterment of their service.
Email filters are another important application of NLP. By analyzing the emails that flow through the servers, email providers can calculate the likelihood that an email is spam based its content by using Bayesian or Naive based spam filtering.
Call centers representatives engage with customers to hear list of specific complaints and problems. Mining this data for sentiment can lead to incredibly actionable intelligence that can be applied to product placement, messaging, design, or a range of other use cases.
Google and Bing and other search systems use NLP to extract terms from text to populate their indexes and to parse search queries.
Google Translate applies machine translation technologies in not only translating words, but in understanding the meaning of sentences to provide a true translation.
Many important decisions in financial markets use NLP by taking plain text announcements, and extracting the relevant info in a format that can be factored into algorithmic trading decisions. E.g. news of a merger between companies can have a big impact on trading decisions, and the speed at which the particulars of the merger, players, prices, who acquires who, can be incorporated into a trading algorithm can have profit implications in the millions of dollars.
Since the invention of the typewriter, the keyboard has been the king of human-computer interface. But today with voice recognition via virtual assistants,like Amazon’s Alexa, Google’s Now, Apple’s Siri and Microsoft’s Cortana respond to vocal prompts and do everything from finding a coffee shop to getting directions to our office and also tasks like turning on the lights in home, switching the heat on etc. depending on how digitized and wired-up our life is.
Question Answering - IBM Watson is the most prominent example of question answering via information retrieval that helps guide in various areas like healthcare, weather, insurance etc.
Therefore it is clear that Natural Language Processing takes a very important role in new machine human interfaces. It’s an essential tool for leading-edge analytics & is the near future."""
Import Sentence Tokenizer¶
In [94]:
from nltk.tokenize import sent_tokenize
Print Sentences¶
In [95]:
sentenceList = sent_tokenize(text)
print(sentenceList)
Import Word Tokenizer¶
In [96]:
from nltk.tokenize import word_tokenize
Print Words¶
In [97]:
wordsInSentences = word_tokenize(text)
print(wordsInSentences)
Import Stopwords¶
In [98]:
from nltk.corpus import stopwords
print(stopwords)
Import Punctuations¶
In [99]:
from string import punctuation
print(punctuation)
Compile a list of English Stopwords including Punctuations¶
In [100]:
_stopwords = set(stopwords.words('english') + list(punctuation))
print(_stopwords)
Get the words without stopwords + punctuations from the tokenized words¶
In [101]:
word_woPunctuation = [word for word in wordsInSentences if word not in _stopwords]
print(word_woPunctuation)
Import Frequency Distribution from Probability¶
In [102]:
from nltk.probability import FreqDist
Get the frequency of the occurences of the Words¶
In [103]:
freq = FreqDist(word_woPunctuation)
print(list(freq.items()))
Import Sort Package¶
In [104]:
from heapq import nlargest
Sort in the descending order as per the largest frequency of the occurences of the Words (Print 10 Words)¶
In [105]:
print(nlargest(10, freq, key=freq.get))
Import Default Dictionary¶
In [106]:
from collections import defaultdict
Initialize ranking as Dictionary with size equivalent to the size of the Default Dictionary¶
In [107]:
ranking = defaultdict(int)
In [108]:
print(ranking)
In [109]:
print(list(enumerate(sentenceList)))
Tokenize the Sentences List that is enumerated. If the word that is tokenized is part of the "frequency of words" list then put it into the ranking dictionary (which is equivalent to the default dictionary).¶
In [110]:
for i, sentenceLists in enumerate(sentenceList):
for word in word_tokenize(sentenceLists.lower()):
if word in freq:
ranking[i] += freq[word]
Print the ranking dictionary, i,e, word index and the corresponding ranking¶
In [111]:
print(ranking)
In [112]:
print(sentenceList)
Sort the ranking dictionary in the descending order of the ranks (Print first 4)¶
In [113]:
sent_idx = nlargest(4, ranking, key=ranking.get)
print(sent_idx)
Summary is the top 4 ranked Sentences from the Sentence List¶
In [114]:
summary = [sentenceList[j] for j in sorted(sent_idx)]
print(summary)