Padhai Time

Stemming & Lemmatization

When we work on some text document, removal of punctuation and stop words are just not enough, there is still something more which needs our attention.

The words that we use in sentences can take any form. Words can be used in present tense, or past or may be in future tense, accordingly the word will change.

For e.g.

- The word ‘Go’ is ‘Go’ / ‘Goes’ in present tense and ‘Went’ in past tense

- The word ‘See’ is ‘See’ / ‘Sees’ in present tense, whereas it is ‘Saw’ in past tense

These inconsistencies in data can affect the model training and predictions, hence, we need to make sure that the words exist in their root forms.

To handle this, there are two methods:

1) Stemming:

Stemming is the process of converting/reducing the inflected words to their root form. In this method, the suffixes are removed from the inflected word so that it becomes the root.

For eg. From the word “Going”, “ing” suffix will get removed and the inflected word “Going” will become “Go” which is the root form.

Few more examples:

Developing -> Develop

Developed -> Develop

Development -> Develop

Develops -> Develop

All these inflected words take their root form when their suffixes are removed. Internally the stemming process uses some rules for trimming the suffix part.

We can implement stemming in Python using famous library called as “nltk”

If you don't have nltk installed in your machine, you can simply type:

pip install nltk

This will install nltk to your system and you should be able to import it.

Python Code:

from nltk.stem import PorterStemmer

porter = PorterStemmer()

print(porter.stem("developing"))

print(porter.stem("develops"))

print(porter.stem("development"))

print(porter.stem("developed"))

Output:

develop

develop

develop

develop

However, there are some words which do not get properly handled by the “Stemming” process.

For e.g. “went”, “flew”, “saw” these words can’t be converted properly to their base forms if Stemming is applied.

Code:

print(porter.stem("went"))

print(porter.stem("flew"))

print(porter.stem("saw"))

Output:

went

flew

saw

Surprisingly, there is no change in the output because the Stemming process is not smart enough. It just knows how to trim the suffix part, but it does not know how to change the form of the word. To solve this issue, there should be some algorithm which understands the linguistic meaning of the sentence and converts each word to its base form accordingly.

Fortunately we have Lemmatization for this work.

Good part about this Stemmer is that not only English, but it is useful for other Languages also.

Pros:

Computationally Fast: As it simply trims the suffix without worrying about the context of word

Cons:

It is not useful enough if you are concerned about the valid words. Stemmer can give you some words which do not have any meaning.

"Goes” -> “goe”

2) Lemmatization:

It is where the words are converted to their root forms by understanding the context of the word in the sentence.

In Stemming, the root word which we get after conversion is called a stem.

Whereas, it is called a lemma in Lemmatization.

Pros:

The root word which we get after conversion holds some meaning and the word belongs to the Dictionary.

Cons:

It is computationally expensive.

NLTK provides a class called Word Net for this purpose.

Code:

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

As mentioned in this tutorial you might have installed the nltk library in your system, but to work with WordNetLemmatizer, you need to download this package explicitly.

Code:

import nltk

nltk.download()

This will launch one window like below and you need to scroll down and select “wordnet” from the list and click on Download.

Once downloaded successfully, you should be able to use WordNetLemmatizer

Code:

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

print(wordnet_lemmatizer.lemmatize("going"))

print(wordnet_lemmatizer.lemmatize("goes")) # Lemmatizer is able to convert it to "go"

print(wordnet_lemmatizer.lemmatize("went"))

print(porter.stem("goes"))   # Stemming is unable to normalize the word "goes" properly

Output:

going

go

went

goe

But you might be wondering that Lemmatizer is unable to normalize the words “going” and “went” into their root forms.

It is because we have not passed the context to it.

Part of speech “pos” is the parameter which we need to specify. By default it is NOUN.

If a word is a verb which we want to normalize, then we need to specify with the value as “v”

Code:

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

print(wordnet_lemmatizer.lemmatize("going", pos="v"))

print(wordnet_lemmatizer.lemmatize("goes", pos="v"))

print(wordnet_lemmatizer.lemmatize("went", pos="v"))

print(wordnet_lemmatizer.lemmatize("go", pos = "v"))

print(wordnet_lemmatizer.lemmatize("studies", pos = "v"))

print(wordnet_lemmatizer.lemmatize("studying", pos = "v"))

print(wordnet_lemmatizer.lemmatize("studied", pos = "v"))

print(wordnet_lemmatizer.lemmatize("dogs"))   # by default, it is noun

print(wordnet_lemmatizer.lemmatize("dogs", pos="n"))

Output:

go

go

go

go

study

study

study

dog

dog

So far we have looked into many text cleaning and normalization techniques. In the next chapter we are going to discuss how to perform feature Engineering over text data.

Stay Tuned!!

Bengaluru, India

contact.padhaitime@gmail.com