NLTK provides multiple tokenizers that behave differently when handling punctuation and special characters.
Let’s look at three alternatives:
1. WordPunctTokenizer
The WordPunctTokenizer splits text more aggressively based on punctuation.
from nltk.tokenize import WordPunctTokenizer tokenizer = WordPunctTokenizer() tokens = tokenizer.tokenize(corpus) print(tokens)
Output:
['Hello', '!', 'Welcome', 'to', 'Arjun', "'", 's', 'NLP', 'tutorials', '.', 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'an', 'expert', 'in', 'NLP', '.']
Notice that "Arjun's" has been split into ["Arjun", "'", "s"], unlike word_tokenize(), which kept "'s" together.
2. TreebankWordTokenizer
TreebankWordTokenizer follows conventions used in the Penn Treebank corpus. It handles punctuation more precisely.
from nltk.tokenize import TreebankWordTokenizer tokenizer = TreebankWordTokenizer() tokens = tokenizer.tokenize(corpus) print(tokens)
Output:
['Hello', '!', 'Welcome', 'to', 'Arjun', "'s", 'NLP', 'tutorials.', 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'an', 'expert', 'in', 'NLP.']
Here:
- Full stops (
.) are attached to the last word (e.g.,tutorials.instead oftutorials,.). - Possessive (
's) remains attached as a single token, which is often desirable in linguistic parsing.
Comparison Summary
| Tokenizer | Splits on punctuation | Treats 's as separate | Keeps . attached to last word |
|---|---|---|---|
word_tokenize() | Yes | No | No |
WordPunctTokenizer() | Yes (aggressively) | Yes | No |
TreebankWordTokenizer() | Partially | No | Yes |
When to Use Which Tokenizer
word_tokenize()– General-purpose tokenization; good for most NLP preprocessing.WordPunctTokenizer()– Use when punctuation needs to be treated as separate tokens.TreebankWordTokenizer()– Use when working with linguistic or syntactic analysis tasks where punctuation should stay attached.
