Process of splitting the text, phrases, sentences into smaller units is called Tokenization.
Example:
We can import different types of tokenizers from the nltk library accordingly.
1) Sentence Tokenizer:
Text data will be splitted into sentences
Code:
from nltk.tokenize import sent_tokenize
text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.'
sent_tokenize(text_data)
Output:
['The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31.',
'Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.']
2) Word Tokenizer:
Text data will be splitted into words
Code:
from nltk.tokenize import word_tokenize
text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.'
word_tokenize(text_data)
Output:
['The', 'market', 'extended', 'gains', 'for', 'the', 'seventh', 'consecutive', 'session', ',', 'climbing', '1', 'percent', 'to', 'end', 'at', 'record', 'closing', 'high', 'on', 'May', '31', '.', 'Reliance', 'Industries', 'continued', 'to', 'be', 'a', 'leader', 'in', 'the', 'rally', ',', 'followed', 'by', 'private', 'banks', '&', 'financials', 'and', 'FMCG', 'stocks', '.']
3) WhitespaceTokenizer: Based on white space, words are splitted. In Previous example, “,”, “.” are not part of the word, as they have their own usage and meaning, hence they are splitted separately and considered as separate tokens by their own. But in Whitespace tokenizer, characters which are occurring together will remain together.
Code:
from nltk.tokenize import WhitespaceTokenizer
text_data = 'The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks & financials and FMCG stocks.'
print(WhitespaceTokenizer().tokenize(text_data))
Output:
['The', 'market', 'extended', 'gains', 'for', 'the', 'seventh', 'consecutive', 'session,', 'climbing', '1', 'percent', 'to', 'end', 'at', 'record', 'closing', 'high', 'on', 'May', '31.', 'Reliance', 'Industries', 'continued', 'to', 'be', 'a', 'leader', 'in', 'the', 'rally,', 'followed', 'by', 'private', 'banks', '&', 'financials', 'and', 'FMCG', 'stocks.']
We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
Our Privacy policy can be found by clicking here