Padhai Time

Bag of Words

Bag of words is a technique to extract features from the provided text data. This technique counts the occurrence of words in the sentence. If the word is found in the sentence, then the occurrence value increases by 1, else its occurrence value remains 0.

This technique is simple and easy to implement but it comes with its own limitations as well. We will discuss that at the end of this article.

How to use Bag of Words Technique?

When we have the text information available with us, it is a prerequisite that we clean our text first.

All the steps discussed in the previous chapter will be applied and text is cleaned. Once done we will use the Bag of words technique to extract features out of it.

Code:

import pandas as pd

import re

import nltk

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer

stopwords = nltk.corpus.stopwords.words('english') # importing stop words list

string_lst = [ '<THE> quick BROWN fox jumping over the lazy dog.',

               'I am TOO LAZY, to do ANYTHING. Please help',

               'Padhai Time is there to help you out in anything quick']

df = pd.DataFrame(string_lst, columns=['msg'])

def clean_data(text_data):

    print("Original: {}\n".format(text_data))

    # Cleaning html tags

    html_pattern = re.compile('<.*?>')

    text_data = re.sub(html_pattern, '', text_data)

    print("Step 1: {}\n".format(text_data))

    # Handling Case inconsistencies

    text_data = text_data.lower()

    print("Step 2: {}\n".format(text_data))

    # Removing Punctuations

    text_data = re.sub(r'[^\w\s]', '', text_data)

    print("Step 3: {}\n".format(text_data))

    # Removing <= 2 letter words

    text_data = ' '.join(word for word in text_data.split() if len(word)>2)

    print("Step 4: {}\n".format(text_data))

    tokenized_text = word_tokenize(text_data)

    text_data = " ".join(word for word in tokenized_text if word not in stopwords)

    print("Step 5: {}\n".format(text_data))

    porter = PorterStemmer()

    text_data = " ".join(porter.stem(word) for word in word_tokenize(text_data))

    print("Step 6: {}\n".format(text_data))

    return text_data

clean_data(df['msg'].iloc[0])

Output:

Original: <THE> quick BROWN fox jumping over the lazy dog.

Step 1: quick BROWN fox jumping over the lazy dog.

Step 2: quick brown fox jumping over the lazy dog.

Step 3: quick brown fox jumping over the lazy dog

Step 4: quick brown fox jumping over the lazy dog

Step 5: quick brown fox jumping lazy dog

Step 6: quick brown fox jump lazi dog

Now, let us use the clean_data() function to clean all the rows and add one new column in the data frame itself with the name “cleaned_msg”

df['cleaned_msg'] = df['msg'].apply(clean_data)

We have raw messages and cleaned messages available with us. Let us use ‘msg’ column and get the count for each word.

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df['msg'])

result = pd.DataFrame(X.toarray())

result.columns = vectorizer.get_feature_names()

result

Do you see any problem in this?

If we do not use cleaned data for feature engineering then we will end up in so many columns in the final dataframe. You can see, in just 3 sentences, there are 22 words present. If we have a dataset of 10,000 rows, then this word count will be too much and the Countvectorizer will return Millions of columns which is not a Good practice.

So this time, we will use cleaned data for feature creation.

vectorizer_clean = CountVectorizer()

X_clean = vectorizer_clean.fit_transform(df['cleaned_msg'])

print(vectorizer_clean.vocabulary_)

Output:

{'quick': 9,

'brown': 1,

'fox': 3,

'jump': 5,

'lazi': 6,

'dog': 2,

'anyth': 0,

'pleas': 8,

'help': 4,

'padhai': 7,

'time': 10}

Now we have got only 11 words which will become our feature in the final dataframe.

result_clean = pd.DataFrame(X_clean.toarray())

result_clean.columns = vectorizer_clean.get_feature_names()

result_clean

Limitations of Bag of Words Approach:

1) Countvectorizer does not understand the meaning of the word.

Take example:

- I go to sleep at 10 PM and go to walk at 7 AM

- I go to walk at 10 PM and go to sleep at 7 AM

After feature engineering, both of these sentences will result in same column values

2) Through the BOW approach, there exists so many zeros in the data and is called sparse matrix.

Bengaluru, India

contact.padhaitime@gmail.com