Using Tokenization in NLTK

NLTK provides multiple tokenizers that behave differently when handling punctuation and special characters.

Let’s look at three alternatives:

1. WordPunctTokenizer

The WordPunctTokenizer splits text more aggressively based on punctuation.

from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(corpus)
print(tokens)

Output:

['Hello', '!', 'Welcome', 'to', 'Arjun', "'", 's', 'NLP', 'tutorials', '.', 
 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'an', 
 'expert', 'in', 'NLP', '.']

Notice that "Arjun's" has been split into ["Arjun", "'", "s"], unlike word_tokenize(), which kept "'s" together.

2. TreebankWordTokenizer

TreebankWordTokenizer follows conventions used in the Penn Treebank corpus. It handles punctuation more precisely.

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(corpus)
print(tokens)

Output:

['Hello', '!', 'Welcome', 'to', 'Arjun', "'s", 'NLP', 'tutorials.', 
 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'an', 
 'expert', 'in', 'NLP.']

Here:

Full stops (.) are attached to the last word (e.g., tutorials. instead of tutorials, .).
Possessive ('s) remains attached as a single token, which is often desirable in linguistic parsing.

Comparison Summary

Tokenizer	Splits on punctuation	Treats `'s` as separate	Keeps `.` attached to last word
`word_tokenize()`	Yes	No	No
`WordPunctTokenizer()`	Yes (aggressively)	Yes	No
`TreebankWordTokenizer()`	Partially	No	Yes

When to Use Which Tokenizer

word_tokenize() – General-purpose tokenization; good for most NLP preprocessing.
WordPunctTokenizer() – Use when punctuation needs to be treated as separate tokens.
TreebankWordTokenizer() – Use when working with linguistic or syntactic analysis tasks where punctuation should stay attached.