Learnitweb

Using Tokenization in NLTK

NLTK provides multiple tokenizers that behave differently when handling punctuation and special characters.

Let’s look at three alternatives:

1. WordPunctTokenizer

The WordPunctTokenizer splits text more aggressively based on punctuation.

from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(corpus)
print(tokens)

Output:

['Hello', '!', 'Welcome', 'to', 'Arjun', "'", 's', 'NLP', 'tutorials', '.', 
 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'an', 
 'expert', 'in', 'NLP', '.']

Notice that "Arjun's" has been split into ["Arjun", "'", "s"], unlike word_tokenize(), which kept "'s" together.

2. TreebankWordTokenizer

TreebankWordTokenizer follows conventions used in the Penn Treebank corpus. It handles punctuation more precisely.

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(corpus)
print(tokens)

Output:

['Hello', '!', 'Welcome', 'to', 'Arjun', "'s", 'NLP', 'tutorials.', 
 'Please', 'do', 'watch', 'the', 'entire', 'course', 'to', 'become', 'an', 
 'expert', 'in', 'NLP.']

Here:

  • Full stops (.) are attached to the last word (e.g., tutorials. instead of tutorials, .).
  • Possessive ('s) remains attached as a single token, which is often desirable in linguistic parsing.

Comparison Summary

TokenizerSplits on punctuationTreats 's as separateKeeps . attached to last word
word_tokenize()YesNoNo
WordPunctTokenizer()Yes (aggressively)YesNo
TreebankWordTokenizer()PartiallyNoYes

When to Use Which Tokenizer

  • word_tokenize() – General-purpose tokenization; good for most NLP preprocessing.
  • WordPunctTokenizer() – Use when punctuation needs to be treated as separate tokens.
  • TreebankWordTokenizer() – Use when working with linguistic or syntactic analysis tasks where punctuation should stay attached.