Padhai Time

Text Cleaning Basics

So far we have learnt what is NLP, what are its components, and what are the challenges faced during Text processing. Now it is the time to do a bit of coding and let us start cleaning the text corpus.

When the text corpus is given to us, it may have following issues:

HTML tags
Upper / Lower Case inconsistency
Punctuations
Stop words
Words not in their root form
And so on. . .

Before using the data for predictions, we need to clean it. Let us start working on these issues one by one:

1) HTML Tags removal:

While scraping the text data from a website, you may get HTML tags included, so it is recommended that we remove them.

Example:

“The market extended gains for the seventh consecutive session, climbing 1 percent to end at <b> record </b> closing high on May 31. Reliance Industries <h2> continued to be a leader </h2> in the rally, followed by <br> private banks & financials and FMCG stocks.”

To clean the above text, let us remove the words which are present in between the angle brackets ‘<’ , ‘>’. We need to write regex (regular expressions) for it.

import re

text_data = '''The market extended gains for the seventh consecutive session, climbing 1 percent to end at <b> record </b> closing high on May 31. Reliance Industries <h2> continued to be a leader </h2> in the rally, followed by <br> private banks & financials and FMCG stocks.'''

html_pattern = re.compile('<.*?>')

text_data = re.sub(html_pattern, '', text_data)

text_data

Output:

“The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.”

Now you can notice that html tags have been replaced with empty strings.

2) Upper and lower case inconsistency:

Let us remove this inconsistency and convert everything into lower case.

text_data = text_data.lower()

text_data

Output:

“the market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on may 31. reliance industries continued to be a leader in the rally, followed by private banks & financials and fmcg stocks.”

3) Remove Punctuations:

Punctuations in the text do not make much sense hence we can remove them.

Example: % ^ & * , ) } etc

text_data = re.sub(r'[^\w\s]', '', text_data)

text_data

Output:

“the market extended gains for the seventh consecutive session climbing 1 percent to end at record closing high on may 31 reliance industries continued to be a leader in the rally followed by private banks financials and fmcg stocks”

4) Remove words having length less than or equal to 2:

Words that provide meaningful information often have word length more than 2.

text_data = ' '.join(word for word in text_data.split() if len(word)>2)

text_data

Output:

'the market extended gains for the seventh consecutive session climbing percent end record closing high may reliance industries continued leader the rally followed private banks financials and fmcg stocks'

Bengaluru, India

contact.padhaitime@gmail.com