We have read so far about Bag of words which only focuses on the frequency of a word in a sentence. Consider below scenarios where Bag of words will not be a good approach to use:
1) Suppose we don’t want to remove stop words from the text corpus. In this case, the frequency of “is”, “the”, “a” will be very high. But actually these words do not make sense in the sentence for a model to learn anything.
2) Suppose if we are processing product reviews on Amazon / Flipkart, then the terms like “product”, “item” are domain dependent and are used too often in each review. Hence these words will not help model in learning anything.
3) The “mobile” keyword in the phone data set is not giving any value. Keywords like “5 GB”, “Splash Proof”, “Android OS” will make more sense.
Hence there is a technique called TF-IDF where most often words are suppressed (given lower importance) and unique words (less frequent) are provided a higher weightage in the sentence.
Let us look at below two sentences:
- I do not like Vanilla Cake
- I do not like Vanilla Icecream
In both of the above sentences, Unique thing or I would say the important keywords to notice are “Cake” and “Icecream”. Remaining terms like “I”, “do”, “not”, “like”, “Vanilla” are repeated in both the sentences, hence they do not provide any useful info to the model.
Terminology:
Corpus: The entire text data given to us. A Corpus can have many documents
Document: Single sentence inside a Text Corpus
Term: Single word inside a Document / Sentence
TF: Term Frequency
IDF: Inverse Document Frequency
Formulas:
Let us take the same Example and calculate Term Frequency and IDF:
Document A: I do not like Vanilla Cake
Document B: I do not like Vanilla Icecream
No. of words in Document A: 6
No. of words in Document B: 6
It is clear from the above approach that less frequent words like ‘cake’ and ‘icecream’ get more weight than more frequent words.
We can achieve the same task by importing TfidfVectorizer from sklearn library.
Thing to note is that every library which is calculating Tf-idf may have a different formula for it. Also, there are certain parameters which you can set for smoothning of the results. So, when you see a different Tf-idf value from sklearn, do not get confused. At Least you got the basic idea behind this approach.
Some of the techniques add 1 in the denominator while calculating the IDF values etc.
Code:
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = nltk.corpus.stopwords.words('english') # importing stop words list
string_lst = [ 'I do not like Vanilla Cake',
'I do not like Vanilla Icecream']
df = pd.DataFrame(string_lst, columns=['msg'])
def clean_data(text_data):
print("Original: {}\n".format(text_data))
# Cleaning html tags
html_pattern = re.compile('<.*?>')
text_data = re.sub(html_pattern, '', text_data)
print("Step 1: {}\n".format(text_data))
# Handling Case inconsistencies
text_data = text_data.lower()
print("Step 2: {}\n".format(text_data))
# Removing Punctuations
text_data = re.sub(r'[^\w\s]', '', text_data)
print("Step 3: {}\n".format(text_data))
# Removing <= 2 letter words
text_data = ' '.join(word for word in text_data.split() if len(word)>2)
print("Step 4: {}\n".format(text_data))
tokenized_text = word_tokenize(text_data)
text_data = " ".join(word for word in tokenized_text if word not in stopwords)
print("Step 5: {}\n".format(text_data))
porter = PorterStemmer()
text_data = " ".join(porter.stem(word) for word in word_tokenize(text_data))
print("Step 6: {}\n".format(text_data))
return text_data
df['cleaned_msg'] = df['msg'].apply(clean_data)
Output:
vectorizer_clean = TfidfVectorizer(smooth_idf=True)
X_clean = vectorizer_clean.fit_transform(df['cleaned_msg'])
print(vectorizer_clean.vocabulary_)
Output:
{'like': 2, 'vanilla': 3, 'cake': 0, 'icecream': 1}
terms = vectorizer_clean.get_feature_names()
terms
Output:
['cake', 'icecream', 'like', 'vanilla']
idf_values = vectorizer_clean.idf_
print("IDF Values: \n", {terms[i]: idf_values[i] for i in range(len(terms))})
Output:
IDF Values:
{'cake': 1.4054651081081644, 'icecream': 1.4054651081081644, 'like': 1.0, 'vanilla': 1.0}
result_clean = pd.DataFrame(X_clean.toarray())
result_clean.columns = vectorizer_clean.get_feature_names()
result_clean
Output:
Now you would have got some sense why Tfidf vectorization is better than Bag of words.
We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
Our Privacy policy can be found by clicking here