Basic Extractions

From the text corpus, we can extract useful information and create variables out of it.


Example: “The quick brown fox jumps over the lazy dog”


From the above sentence, we can extract few meaningful information like:


  • How many words are present?
  • How many characters are present in the sentence?
  • What is the average length of each word?
  • How many lowercase words are present?
  • How many Uppercase words are present?
  • What is the length of the longest/smallest word?
  • How many stop words are present?


For Stop words, we will import nltk, and for other set of variables, there is no need. 



import pandas as pd
import nltk
stopwords = nltk.corpus.stopwords.words('english')  # importing stop words list
string_lst = [ 'THE quick BROWN fox jumps over the lazy dog', 
                    'I am TOO LAZY to do ANYTHING', 
                    'Padhai Time is there to help you out']
df = pd.DataFrame(string_lst, columns=['msg'])
def derive_features(message):
    words_lst = message.split()
    num_charactes = len(message)
    num_words = len(words_lst)
    words_length = []
    lower_words_lst = []
    upper_words_lst = []
    is_stop_word_lst = []
    for word in words_lst:
        is_stop_word_lst.append(word.lower() in stopwords)
    stop_words_count = sum(is_stop_word_lst)
    avg_word_length = round(sum(words_length)/len(words_length), 1)
    max_length_word = max(words_length)
    min_length_word = min(words_length)
    total_lower_words = sum(lower_words_lst)
    total_upper_words = sum(upper_words_lst)
    return num_charactes, num_words, avg_word_length, total_lower_words, total_upper_words, max_length_word, min_length_word, stop_words_count

df['num_chars'], df['num_words'], df['avg_word_len'], df['num_lower_words'], df['num_upper_words'], df['max_length_word'], df['min_length_word'], df['num_stop_words'] = zip(*df['msg'].apply(lambda r: derive_features(r)))


