PadhaiTime Logo
Padhai Time

Bag of Words

Bag of words is a technique to extract features from the provided text data. This technique counts the occurrence of words in the sentence. If the word is found in the sentence, then the occurrence value increases by 1, else its occurrence value remains 0.

 

This technique is simple and easy to implement but it comes with its own limitations as well. We will discuss that at the end of this article.

 

How to use Bag of Words Technique?

When we have the text information available with us, it is a prerequisite that we clean our text first.

All the steps discussed in the previous chapter will be applied and text is cleaned. Once done we will use the Bag of words technique to extract features out of it.

  

Code:

import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

stopwords = nltk.corpus.stopwords.words('english') # importing stop words list
 
string_lst = [ '<THE> quick BROWN fox jumping over the lazy dog.', 
               'I am TOO LAZY, to do ANYTHING. Please help', 
               'Padhai Time is there to help you out in anything quick']
 
df = pd.DataFrame(string_lst, columns=['msg'])

def clean_data(text_data):
    print("Original: {}\n".format(text_data))
    
    # Cleaning html tags
    html_pattern = re.compile('<.*?>')
    text_data = re.sub(html_pattern, '', text_data)
    print("Step 1: {}\n".format(text_data))
    
    # Handling Case inconsistencies
    text_data = text_data.lower()
    print("Step 2: {}\n".format(text_data))
    
    # Removing Punctuations
    text_data = re.sub(r'[^\w\s]', '', text_data)
    print("Step 3: {}\n".format(text_data))
    
    # Removing <= 2 letter words
    text_data = ' '.join(word for word in text_data.split() if len(word)>2)
    print("Step 4: {}\n".format(text_data))
    
    tokenized_text = word_tokenize(text_data)
    text_data = " ".join(word for word in tokenized_text if word not in stopwords)
    print("Step 5: {}\n".format(text_data))
    
    porter = PorterStemmer()
    text_data = " ".join(porter.stem(word) for word in word_tokenize(text_data))
    print("Step 6: {}\n".format(text_data))
          
    return text_data


clean_data(df['msg'].iloc[0])

Output:

Original: <THE> quick BROWN fox jumping over the lazy dog.

Step 1: quick BROWN fox jumping over the lazy dog.

Step 2: quick brown fox jumping over the lazy dog.

Step 3: quick brown fox jumping over the lazy dog

Step 4: quick brown fox jumping over the lazy dog

Step 5: quick brown fox jumping lazy dog

Step 6: quick brown fox jump lazi dog

 

 

Now, let us use the clean_data() function to clean all the rows and add one new column in the data frame itself with the name “cleaned_msg”

 

df['cleaned_msg'] = df['msg'].apply(clean_data)

 

undefined

 

We have raw messages and cleaned messages available with us. Let us use ‘msg’ column and get the count for each word.

 

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['msg'])
result = pd.DataFrame(X.toarray())
result.columns = vectorizer.get_feature_names()
result

undefined

 

Do you see any problem in this?

If we do not use cleaned data for feature engineering then we will end up in so many columns in the final dataframe. You can see, in just 3 sentences, there are 22 words present. If we have a dataset of 10,000 rows, then this word count will be too much and the Countvectorizer will return Millions of columns which is not a Good practice.

 

So this time, we will use cleaned data for feature creation. 

 

vectorizer_clean = CountVectorizer()
X_clean = vectorizer_clean.fit_transform(df['cleaned_msg'])
print(vectorizer_clean.vocabulary_)

 

Output:

{'quick': 9,

 'brown': 1,

 'fox': 3,

 'jump': 5,

 'lazi': 6,

 'dog': 2,

 'anyth': 0,

 'pleas': 8,

 'help': 4,

 'padhai': 7,

 'time': 10}

 

Now we have got only 11 words which will become our feature in the final dataframe. 

 

result_clean = pd.DataFrame(X_clean.toarray())
result_clean.columns = vectorizer_clean.get_feature_names()
result_clean

 

undefined

 

Limitations of Bag of Words Approach:

1) Countvectorizer does not understand the meaning of the word.

    Take example:

     - I go to sleep at 10 PM and go to walk at 7 AM

     - I go to walk at 10 PM and go to sleep at 7 AM

     After feature engineering, both of these sentences will result in same column values

2) Through the BOW approach, there exists so many zeros in the data and is called sparse matrix.

 

Bengaluru, India
contact.padhaitime@gmail.com
  • We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
    Our Privacy policy can be found by clicking here