Padhai Time

Term Frequency - Inverse Document Frequency

We have read so far about Bag of words which only focuses on the frequency of a word in a sentence. Consider below scenarios where Bag of words will not be a good approach to use:

1) Suppose we don’t want to remove stop words from the text corpus. In this case, the frequency of “is”, “the”, “a” will be very high. But actually these words do not make sense in the sentence for a model to learn anything.

2) Suppose if we are processing product reviews on Amazon / Flipkart, then the terms like “product”, “item” are domain dependent and are used too often in each review. Hence these words will not help model in learning anything.

3) The “mobile” keyword in the phone data set is not giving any value. Keywords like “5 GB”, “Splash Proof”, “Android OS” will make more sense.

Hence there is a technique called TF-IDF where most often words are suppressed (given lower importance) and unique words (less frequent) are provided a higher weightage in the sentence.

Let us look at below two sentences:

- I do not like Vanilla Cake

- I do not like Vanilla Icecream

In both of the above sentences, Unique thing or I would say the important keywords to notice are “Cake” and “Icecream”. Remaining terms like “I”, “do”, “not”, “like”, “Vanilla” are repeated in both the sentences, hence they do not provide any useful info to the model.

Terminology:

Corpus: The entire text data given to us. A Corpus can have many documents

Document: Single sentence inside a Text Corpus

Term: Single word inside a Document / Sentence

TF: Term Frequency

IDF: Inverse Document Frequency

Formulas:

Let us take the same Example and calculate Term Frequency and IDF:

Document A: I do not like Vanilla Cake

Document B: I do not like Vanilla Icecream

No. of words in Document A: 6

No. of words in Document B: 6

It is clear from the above approach that less frequent words like ‘cake’ and ‘icecream’ get more weight than more frequent words.

We can achieve the same task by importing TfidfVectorizer from sklearn library.

Thing to note is that every library which is calculating Tf-idf may have a different formula for it. Also, there are certain parameters which you can set for smoothning of the results. So, when you see a different Tf-idf value from sklearn, do not get confused. At Least you got the basic idea behind this approach.

Some of the techniques add 1 in the denominator while calculating the IDF values etc.

Code:

import pandas as pd

import re

import nltk

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = nltk.corpus.stopwords.words('english') # importing stop words list

string_lst = [ 'I do not like Vanilla Cake',

               'I do not like Vanilla Icecream']

df = pd.DataFrame(string_lst, columns=['msg'])

def clean_data(text_data):

    print("Original: {}\n".format(text_data))

    # Cleaning html tags

    html_pattern = re.compile('<.*?>')

    text_data = re.sub(html_pattern, '', text_data)

    print("Step 1: {}\n".format(text_data))

    # Handling Case inconsistencies

    text_data = text_data.lower()

    print("Step 2: {}\n".format(text_data))

    # Removing Punctuations

    text_data = re.sub(r'[^\w\s]', '', text_data)

    print("Step 3: {}\n".format(text_data))

    # Removing <= 2 letter words

    text_data = ' '.join(word for word in text_data.split() if len(word)>2)

    print("Step 4: {}\n".format(text_data))

    tokenized_text = word_tokenize(text_data)

    text_data = " ".join(word for word in tokenized_text if word not in stopwords)

    print("Step 5: {}\n".format(text_data))

    porter = PorterStemmer()

    text_data = " ".join(porter.stem(word) for word in word_tokenize(text_data))

    print("Step 6: {}\n".format(text_data))

    return text_data

df['cleaned_msg'] = df['msg'].apply(clean_data)

Output:

vectorizer_clean = TfidfVectorizer(smooth_idf=True)

X_clean = vectorizer_clean.fit_transform(df['cleaned_msg'])

print(vectorizer_clean.vocabulary_)

Output:

 {'like': 2, 'vanilla': 3, 'cake': 0, 'icecream': 1}

terms = vectorizer_clean.get_feature_names()

terms

Output:

['cake', 'icecream', 'like', 'vanilla']

idf_values = vectorizer_clean.idf_

print("IDF Values: \n", {terms[i]: idf_values[i] for i in range(len(terms))})

Output:

IDF Values:
{'cake': 1.4054651081081644, 'icecream': 1.4054651081081644, 'like': 1.0, 'vanilla': 1.0}

result_clean = pd.DataFrame(X_clean.toarray())

result_clean.columns = vectorizer_clean.get_feature_names()

result_clean

Output:

Now you would have got some sense why Tfidf vectorization is better than Bag of words.

Bengaluru, India

contact.padhaitime@gmail.com