Padhai Time

Stop word Removal

There are words in our sentences which do not provide any relevant information and hence they can be removed from the text.

Example: and, of, at, it, the etc.

There are multiple NLP libraries which operate on text and provide functionality to remove stop words. Some of the famous libraries that provide support for Stop word removal:

NLTK
Spacy
Gensim

We are going to use NLTK for this tutorial

If you do not have this library in your system, you can install it via below command:

pip install nltk

Code:

import nltk

stopwords = nltk.corpus.stopwords.words('english')

Now you can check the stop word list using the statement:

This will provide a list of all the stop words

len(stopwords) => 179

Now, let us import the word tokenizer library which will split our text corpus into words. Later on these words will be checked whether they are part of the stop word list or not, if they are part of it, we need to ignore that word.

from nltk.tokenize import word_tokenize

tokenized_text = word_tokenize(text_data)

tokenized_text = word_tokenize(text_data)

print(tokenized_text)

Output:
['the', 'market', 'extended', 'gains', 'for', 'the', 'seventh', 'consecutive', 'session', 'climbing', 'percent', 'end', 'record', 'closing', 'high', 'may', 'reliance', 'industries', 'continued', 'leader', 'the', 'rally', 'followed', 'private', 'banks', 'financials', 'and', 'fmcg', 'stocks']

At this point, let us check for each word and remove the stop words

removed_stop_words_list = [word for word in tokenized_text if word not in stopwords]

print(removed_stop_words_list)

Output:
['market', 'extended', 'gains', 'seventh', 'consecutive', 'session', 'climbing', 'percent', 'record', 'closing', 'high', 'may', 'reliance', 'industries', 'continued', 'leader', 'rally', 'followed', 'private', 'banks', 'financials', 'fmcg', 'stocks']

It is clearly visible that 'the', 'for', 'and' have been removed from the text

Important Point: Let’s say there is some word which does not make sense in your domain and you want to remove it too. There is a way by which you can enhance your stop words list by adding this word into Stop words list and later you can apply the same step for removal.

Example: ‘fmcg’ is a more common word in your domain so you want to remove it.

stopwords.append('fmcg')

len(stopwords) => 180

removed_stop_words_list = [word for word in tokenized_text if word not in stopwords]

print(removed_stop_words_list)

Output:

['market', 'extended', 'gains', 'seventh', 'consecutive', 'session', 'climbing', 'percent', 'end', 'record', 'closing', 'high', 'may', 'reliance', 'industries', 'continued', 'leader', 'rally', 'followed', 'private', 'banks', 'financials', 'stocks']

At this point, we have cleaned data, however, there are some words which are not in their root form. And this problem can affect the model’s accuracy. Hence it is recommended to convert the words to their base forms. We are going to learn these techniques going forward. Stay tuned!!

Bengaluru, India

contact.padhaitime@gmail.com