Natural Language Processing Using Python

1. What is natural language processing?

Natural language processing refers to processing and analyzing textual, qualitative data using computers. It relies on algorithms to derive meaning from human language in such a way that we can process it like we would with quantitative data.

Examples include:

  • Recognizing the role of a work within a sentence (e.g. subject, verb, etc.). This is called “Part-Of-Speech” tagging, or in short POS-tagging.
  • Extracting important segments of sentences (noun-phrase extraction).
  • Classification of texts according to their contents.
  • Assessing if texts are positive or negative in tone or whether they contain subjectivity. This is called sentiment analyses.

2. Available packages

There are many packages available for you to start with NLP. However, it is important to note that not all packages support languages other than English out of the box. It might be that you need to train your own model with training data in another language. Alternatively, you could consider if translating the text at hand is a suitable solution.

Some of the better-known packages include:

  • TextBlob. Which we will use in this tutorial. This is actually based on NLTK and [pattern]https://www.clips.uantwerpen.be/pages/pattern-en).
  • Frog. Which supports the Dutch language, but requires some more steps for the installation. If you are interested, this package is included in La Machine, which provides you with a ready to go environment to get started with NLP. If you are planning to do a larger project where you apply NLP, this is definitely something to consider using!
  • SpaCy. Which has a Dutch extension, but does not include sentiment analyses.
  • NLTK

NLTK also provides a repository of corpora, which can be used to train NLP algorithms.

3. Installing TextBlob

In the sections below, we will provide you with a brief introduction to some methods for performing NLP. We will use TextBlob, a Python package for NLP, as an illustration.

Please install TextBlob from your terminal / command prompt using pip as shown below.

$ pip install -U textblob
$ python -m textblob.download_corpora

For additional instructions (e.g. when using Conda) please check here.

The examples in this tutorial are executed in Spyder.

4. Getting started with TextBlob

TextBlob works with its own objects called “TextBlobs”. These blobs are string-like objects that we can store to a variable. They contain a piece of text (string) which can be of any length.

# Import the TextBlob() fucntion.
from textblob import TextBlob

# Store a piece of text to a string variable.
text = "This is a text. It is stored as a string."

# Create a blob of the string stored to "text" using the TextBlob() function.
blob = TextBlob(text)

# We can also directly create a blob without storing a string to a variable first.
blob2 = TextBlob("This is also a string. In this case the blob is created directly.")

We can pass several parameters to the TextBlob() function. However, we will not go into detail about this. Please check here if you want to learn more.

In some cases you might want to split up the text of your blob into separate sentences or words before you start your analyses. This is also called tokenization.

You can do so as follows.

from textblob import TextBlob

text = "This is a text. It is stored as a string."
blob = TextBlob(text)

# Split the text in the blob into separate words and print the wordlist.
wordlist = blob.words
print (wordlist)

# Split the text in the blob into separate sentences and print them.
sentences = blob.sentences
print (sentences)

Output:

  • [‘This’, ‘is’, ‘a’, ‘text’, ‘It’, ‘is’, ‘stored’, ‘as’, ‘a’, ‘string’]
  • [Sentence(“This is a text.”), Sentence(“It is stored as a string.”)]

5. Part of Speech (POS) tagging

Part of speech (POS) tagging refers to the process of adding tags to the words in the blob. These tags indicate the role of the word within the sentence. This process is analogical to children in primary school that have to identify the nouns, subject, verbs, adjectives, adverbs, etc. in sentences. It helps us better understand what the sentence is about. We could, for example, use this information to categorize a list of sentences based on their subject.

The tagging is based on a corpus. A corpus refers to a training/reference set that contains a large number of words or phrases that have been tagged by someone. Based on this information, the POS tagger can classify words in new pieces of text.

You can simply access this information through the tags property of a blob.

from textblob import TextBlob

text = "This is a text. It is stored as a string."
blob = TextBlob(text)

# Print the POS tags of the words in the text.
POStags = blob.tags
print(POStags)

Output:

  • [(‘This’, ‘DT’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘text’, ‘NN’), (‘It’, ‘PRP’), (‘is’, ‘VBZ’), (‘stored’, ‘VBN’), (‘as’, ‘IN’), (‘a’, ‘DT’), (‘string’, ‘NN’)]

6. Noun Phrase Extraction

Noun phrases are POS patterns that include one or multiple nouns. They can also include other parts of a sentence that relate to the noun. For example, “interesting book” can be a noun phrase.

Some common POS patterns for noun phrases include:

  • Noun -Noun-Noun … -Noun
  • Adjective(s)-Noun
  • Verb-(Adjectives-)Noun

Extracting the noun phrases from a text can help you capture the essence.

You can access a list of noun phrases through the noun_phrases property of a blob.

from textblob import TextBlob

text = "This is a beautiful text."
blob = TextBlob(text)

# Print the noun phrases in the text.
nounPhrases = blob.noun_phrases
print(nounPhrases)

Output:

  • [‘beautiful text’]

7. Sentiment Analyses

Sentiment analysis refers to accessing the tone (positive/negative) and degree of subjectivity of a text. We can use this to evaluate if a text contains opinions about something and whether those opinions are positive or negative. Combined with POS tagging and noun phrase extraction, this can help us understand the meaning of opinions.

TextBlob evaluates the sentiment of texts according to two scores.

  • Polarity: The degree to which the piece of text (blob) is positive. Ranging from -1 (very negative) to +1 (very positive).
  • Subjectivity: The degree to which the piece of text (blob) is opinionated. Ranging from 0 (very objective) to 1 (very subjective).

Note that since the output of the sentiment analyses is of type “Sentiment”, we need to convert it to a string using str() before we can print it.

from textblob import TextBlob

blobNeg = TextBlob("I absolutely hate this.")
blobPos = TextBlob("I totally love this!")

# Access the polarity and subjectivity of these two blobs and print scores.
sentiment1 = blobNeg.sentiment
sentiment2 = blobPos.sentiment

print("The scores of the first blob: " + str(sentiment1))
print("The scores of the second blob: " + str(sentiment2))

Output:

  • The scores of the first blob: Sentiment(polarity=-0.8, subjectivity=0.9)
  • The scores of the second blob: Sentiment(polarity=0.625, subjectivity=0.6)

8. Other useful tools from TextBlob

Inflections and Lemmatizations

A lemma refers to the form of a word as it is included in a dictionary. For example, cycling, cycle, cycled, etc. all relate to the lemma (to) cycle. Likewise we also know inflections, such as pluralizations of words (e.g. dog, dogs).

When we analyze text, it can be useful to identify the lemmas and inflections of words to filter, categorize or develop frequency plots. For example, we can group all sentences that refer to cycling or dogs in one form or another.

When we split up a blob into words (blob.words), each word is an object in itself on which we can use several useful methods as shown below. Alternatively, we can define a new Word object using the Word() function.

# Note that we need to import "Word" now.
from textblob import Word

# Create word objects.
word1 = Word("dog")
word2 = Word("went")

# Pluralize word1 and print result.
plural = word1.pluralize()
print (word1 + " , " + plural)

# Lemmatize word2 and print results.
# Need to pass the POS tag as argument. In this case "went" is a verb, so the tag is "v".
lemma = word2.lemmatize("v")
print (word2 + " , " + lemma)

Output:

  • dogs, dog
  • went, go