When we work on some text document, removal of punctuation and stop words are just not enough, there is still something more which needs our attention.
The words that we use in sentences can take any form. Words can be used in present tense, or past or may be in future tense, accordingly the word will change.
For e.g.
- The word ‘Go’ is ‘Go’ / ‘Goes’ in present tense and ‘Went’ in past tense
- The word ‘See’ is ‘See’ / ‘Sees’ in present tense, whereas it is ‘Saw’ in past tense
These inconsistencies in data can affect the model training and predictions, hence, we need to make sure that the words exist in their root forms.
To handle this, there are two methods:
1) Stemming:
Stemming is the process of converting/reducing the inflected words to their root form. In this method, the suffixes are removed from the inflected word so that it becomes the root.
For eg. From the word “Going”, “ing” suffix will get removed and the inflected word “Going” will become “Go” which is the root form.
Few more examples:
Developing -> Develop
Developed -> Develop
Development -> Develop
Develops -> Develop
All these inflected words take their root form when their suffixes are removed. Internally the stemming process uses some rules for trimming the suffix part.
We can implement stemming in Python using famous library called as “nltk”
If you don't have nltk installed in your machine, you can simply type:
pip install nltk
This will install nltk to your system and you should be able to import it.
Python Code:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
print(porter.stem("developing"))
print(porter.stem("develops"))
print(porter.stem("development"))
print(porter.stem("developed"))
Output:
develop
develop
develop
develop
However, there are some words which do not get properly handled by the “Stemming” process.
For e.g. “went”, “flew”, “saw” these words can’t be converted properly to their base forms if Stemming is applied.
Code:
print(porter.stem("went"))
print(porter.stem("flew"))
print(porter.stem("saw"))
Output:
went
flew
saw
Surprisingly, there is no change in the output because the Stemming process is not smart enough. It just knows how to trim the suffix part, but it does not know how to change the form of the word. To solve this issue, there should be some algorithm which understands the linguistic meaning of the sentence and converts each word to its base form accordingly.
Fortunately we have Lemmatization for this work.
Good part about this Stemmer is that not only English, but it is useful for other Languages also.
Pros:
Computationally Fast: As it simply trims the suffix without worrying about the context of word
Cons:
It is not useful enough if you are concerned about the valid words. Stemmer can give you some words which do not have any meaning.
"Goes” -> “goe”
2) Lemmatization:
It is where the words are converted to their root forms by understanding the context of the word in the sentence.
In Stemming, the root word which we get after conversion is called a stem.
Whereas, it is called a lemma in Lemmatization.
Pros:
The root word which we get after conversion holds some meaning and the word belongs to the Dictionary.
Cons:
It is computationally expensive.
NLTK provides a class called Word Net for this purpose.
Code:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
As mentioned in this tutorial you might have installed the nltk library in your system, but to work with WordNetLemmatizer, you need to download this package explicitly.
Code:
import nltk
nltk.download()
This will launch one window like below and you need to scroll down and select “wordnet” from the list and click on Download.
Once downloaded successfully, you should be able to use WordNetLemmatizer
Code:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print(wordnet_lemmatizer.lemmatize("going"))
print(wordnet_lemmatizer.lemmatize("goes")) # Lemmatizer is able to convert it to "go"
print(wordnet_lemmatizer.lemmatize("went"))
print(porter.stem("goes")) # Stemming is unable to normalize the word "goes" properly
Output:
going
go
went
goe
But you might be wondering that Lemmatizer is unable to normalize the words “going” and “went” into their root forms.
It is because we have not passed the context to it.
Part of speech “pos” is the parameter which we need to specify. By default it is NOUN.
If a word is a verb which we want to normalize, then we need to specify with the value as “v”
Code:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
print(wordnet_lemmatizer.lemmatize("going", pos="v"))
print(wordnet_lemmatizer.lemmatize("goes", pos="v"))
print(wordnet_lemmatizer.lemmatize("went", pos="v"))
print(wordnet_lemmatizer.lemmatize("go", pos = "v"))
print(wordnet_lemmatizer.lemmatize("studies", pos = "v"))
print(wordnet_lemmatizer.lemmatize("studying", pos = "v"))
print(wordnet_lemmatizer.lemmatize("studied", pos = "v"))
print(wordnet_lemmatizer.lemmatize("dogs")) # by default, it is noun
print(wordnet_lemmatizer.lemmatize("dogs", pos="n"))
Output:
go
go
go
go
study
study
study
dog
dog
So far we have looked into many text cleaning and normalization techniques. In the next chapter we are going to discuss how to perform feature Engineering over text data.
Stay Tuned!!
We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
Our Privacy policy can be found by clicking here