Padhai Time

Tokenization

Process of splitting the text, phrases, sentences into smaller units is called Tokenization.

Example:

Splitting of a text into sentences (Sentence is considered as token)
Splitting of a sentence into words (Word is considered as token)

We can import different types of tokenizers from the nltk library accordingly.

1) Sentence Tokenizer:

Text data will be splitted into sentences

Code:

from nltk.tokenize import sent_tokenize

text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at  record  closing high on May 31. Reliance Industries  continued to be a leader  in the rally, followed by  private banks & financials and FMCG stocks.'

sent_tokenize(text_data)

Output:

['The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31.',

'Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.']

2) Word Tokenizer:

Text data will be splitted into words

Code:

from nltk.tokenize import word_tokenize

text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at  record  closing high on May 31. Reliance Industries  continued to be a leader  in the rally, followed by  private banks & financials and FMCG stocks.'

word_tokenize(text_data)

Output:

['The', 'market', 'extended', 'gains', 'for', 'the', 'seventh', 'consecutive', 'session', ',', 'climbing', '1', 'percent', 'to', 'end', 'at', 'record', 'closing', 'high', 'on', 'May', '31', '.', 'Reliance', 'Industries', 'continued', 'to', 'be', 'a', 'leader', 'in', 'the', 'rally', ',', 'followed', 'by', 'private', 'banks', '&', 'financials', 'and', 'FMCG', 'stocks', '.']

3) WhitespaceTokenizer: Based on white space, words are splitted. In Previous example, “,”, “.” are not part of the word, as they have their own usage and meaning, hence they are splitted separately and considered as separate tokens by their own. But in Whitespace tokenizer, characters which are occurring together will remain together.

Code:

from nltk.tokenize import WhitespaceTokenizer

text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at  record  closing high on May 31. Reliance Industries  continued to be a leader  in the rally, followed by  private banks & financials and FMCG stocks.'

print(WhitespaceTokenizer().tokenize(text_data))

Output:

['The', 'market', 'extended', 'gains', 'for', 'the', 'seventh', 'consecutive', 'session,', 'climbing', '1', 'percent', 'to', 'end', 'at', 'record', 'closing', 'high', 'on', 'May', '31.', 'Reliance', 'Industries', 'continued', 'to', 'be', 'a', 'leader', 'in', 'the', 'rally,', 'followed', 'by', 'private', 'banks', '&', 'financials', 'and', 'FMCG', 'stocks.']

Bengaluru, India

contact.padhaitime@gmail.com