PadhaiTime Logo
Padhai Time

Basic Extractions

From the text corpus, we can extract useful information and create variables out of it.

 

Example: “The quick brown fox jumps over the lazy dog”

 

From the above sentence, we can extract few meaningful information like:

 

  • How many words are present?
  • How many characters are present in the sentence?
  • What is the average length of each word?
  • How many lowercase words are present?
  • How many Uppercase words are present?
  • What is the length of the longest/smallest word?
  • How many stop words are present?

 

For Stop words, we will import nltk, and for other set of variables, there is no need. 

 

Code:

import pandas as pd
import nltk
stopwords = nltk.corpus.stopwords.words('english')  # importing stop words list
 
string_lst = [ 'THE quick BROWN fox jumps over the lazy dog', 
                    'I am TOO LAZY to do ANYTHING', 
                    'Padhai Time is there to help you out']
 
df = pd.DataFrame(string_lst, columns=['msg'])
 
def derive_features(message):
    words_lst = message.split()
     
    num_charactes = len(message)
    num_words = len(words_lst)
     
    words_length = []
    lower_words_lst = []
    upper_words_lst = []
    is_stop_word_lst = []
    for word in words_lst:
        words_length.append(len(word))
        lower_words_lst.append(word.islower())
        upper_words_lst.append(word.isupper())
        is_stop_word_lst.append(word.lower() in stopwords)
     
    stop_words_count = sum(is_stop_word_lst)
    avg_word_length = round(sum(words_length)/len(words_length), 1)
    max_length_word = max(words_length)
    min_length_word = min(words_length)
    total_lower_words = sum(lower_words_lst)
    total_upper_words = sum(upper_words_lst)
     
    return num_charactes, num_words, avg_word_length, total_lower_words, total_upper_words, max_length_word, min_length_word, stop_words_count


df['num_chars'], df['num_words'], df['avg_word_len'], df['num_lower_words'], df['num_upper_words'], df['max_length_word'], df['min_length_word'], df['num_stop_words'] = zip(*df['msg'].apply(lambda r: derive_features(r)))

undefined

Bengaluru, India
contact.padhaitime@gmail.com
  • We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
    Our Privacy policy can be found by clicking here