PadhaiTime Logo
Padhai Time

Term Frequency - Inverse Document Frequency

We have read so far about Bag of words which only focuses on the frequency of a word in a sentence. Consider below scenarios where Bag of words will not be a good approach to use:

 

1) Suppose we don’t want to remove stop words from the text corpus. In this case, the frequency of “is”, “the”, “a” will be very high. But actually these words do not make sense in the sentence for a model to learn anything.

2) Suppose if we are processing product reviews on Amazon / Flipkart, then the terms like “product”, “item” are domain dependent and are used too often in each review. Hence these words will not help model in learning anything.

3) The “mobile” keyword in the phone data set is not giving any value. Keywords like “5 GB”, “Splash Proof”, “Android OS” will make more sense.

 

Hence there is a technique called TF-IDF where most often words are suppressed (given lower importance) and unique words (less frequent) are provided a higher weightage in the sentence.

 

Let us look at below two sentences:

- I do not like Vanilla Cake

- I do not like Vanilla Icecream

   

In both of the above sentences, Unique thing or I would say the important keywords to notice are “Cake” and “Icecream”. Remaining terms like “I”, “do”, “not”, “like”, “Vanilla” are repeated in both the sentences, hence they do not provide any useful info to the model.

  

Terminology:

 

Corpus: The entire text data given to us. A Corpus can have many documents

Document: Single sentence inside a Text Corpus

Term: Single word inside a Document / Sentence

TF: Term Frequency

IDF: Inverse Document Frequency

 

Formulas:

undefined

  

Let us take the same Example and calculate Term Frequency and IDF:

 

Document A: I do not like Vanilla Cake

Document B: I do not like Vanilla Icecream

 

No. of words in Document A: 6

No. of words in Document B: 6

  

undefined

 

undefined

 

It is clear from the above approach that less frequent words like ‘cake’ and ‘icecream’ get more weight than more frequent words.

We can achieve the same task by importing TfidfVectorizer from sklearn library.

 

Thing to note is that every library which is calculating Tf-idf may have a different formula for it. Also, there are certain parameters which you can set for smoothning of the results. So, when you see a different Tf-idf value from sklearn, do not get confused. At Least you got the basic idea behind this approach.

Some of the techniques add 1 in the denominator while calculating the IDF values etc.

 

Code:

import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = nltk.corpus.stopwords.words('english') # importing stop words list
   
string_lst = [ 'I do not like Vanilla Cake', 
               'I do not like Vanilla Icecream']
   
df = pd.DataFrame(string_lst, columns=['msg'])
  
def clean_data(text_data):
    print("Original: {}\n".format(text_data))
     
    # Cleaning html tags
    html_pattern = re.compile('<.*?>')
    text_data = re.sub(html_pattern, '', text_data)
    print("Step 1: {}\n".format(text_data))
      
    # Handling Case inconsistencies
    text_data = text_data.lower()
    print("Step 2: {}\n".format(text_data))
      
    # Removing Punctuations
    text_data = re.sub(r'[^\w\s]', '', text_data)
    print("Step 3: {}\n".format(text_data))
     
    # Removing <= 2 letter words
    text_data = ' '.join(word for word in text_data.split() if len(word)>2)
    print("Step 4: {}\n".format(text_data))
      
    tokenized_text = word_tokenize(text_data)
    text_data = " ".join(word for word in tokenized_text if word not in stopwords)
    print("Step 5: {}\n".format(text_data))
      
    porter = PorterStemmer()
    text_data = " ".join(porter.stem(word) for word in word_tokenize(text_data))
    print("Step 6: {}\n".format(text_data))
           
    return text_data
  
df['cleaned_msg'] = df['msg'].apply(clean_data)

 

Output:

undefined

 

vectorizer_clean = TfidfVectorizer(smooth_idf=True)
X_clean = vectorizer_clean.fit_transform(df['cleaned_msg'])
print(vectorizer_clean.vocabulary_)

Output:

 {'like': 2, 'vanilla': 3, 'cake': 0, 'icecream': 1} 

terms = vectorizer_clean.get_feature_names()
terms

Output:

['cake', 'icecream', 'like', 'vanilla'] 

idf_values = vectorizer_clean.idf_
print("IDF Values: \n", {terms[i]: idf_values[i] for i in range(len(terms))})

Output:

IDF Values:
{'cake': 1.4054651081081644, 'icecream': 1.4054651081081644, 'like': 1.0, 'vanilla': 1.0}
 

result_clean = pd.DataFrame(X_clean.toarray())
result_clean.columns = vectorizer_clean.get_feature_names()
result_clean

Output:

undefined

 

Now you would have got some sense why Tfidf vectorization is better than Bag of words.

Bengaluru, India
contact.padhaitime@gmail.com
  • We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
    Our Privacy policy can be found by clicking here