Natural Language Processing (NLP) is a fundamental area of Artificial Intelligence that focuses on enabling computers to understand, interpret, and generate human language. Before diving into complex NLP concepts, it’s essential to understand some foundational terms and one of the most important preprocessing steps — tokenization.
This tutorial introduces basic terminologies used throughout NLP and then explains what tokenization is, why it is needed, and how it forms the basis of text preprocessing.
1. Key NLP Terminologies
When working with text data, you’ll frequently encounter terms such as corpus, document, vocabulary, and tokens. Let’s clearly define each of them with examples.
1.1 Corpus
A corpus (plural: corpora) refers to a collection of textual data. It can be a single paragraph, a collection of documents, or even an entire dataset used for analysis.
Example:
If you have a paragraph like:
“My name is Arjun. I have an interest in teaching machine learning and NLP.”
Then this entire paragraph can be referred to as a corpus.
In practical NLP applications, corpora may include large collections such as thousands of articles, emails, or tweets.
1.2 Document
A document is a single piece of text within a corpus. It could be a single sentence, a paragraph, or even an entire article, depending on the context.
Example:
In the above corpus, each sentence such as
“My name is Arjun.”
can be treated as one document.
So, a corpus may contain multiple documents.
1.3 Vocabulary
A vocabulary represents the set of all unique words present in a corpus.
It’s similar to a dictionary containing every distinct word that appears at least once in the text.
Example:
Consider these two sentences:
- I like to drink apple juice.
- My friend likes mango juice.
All words combined:[I, like, to, drink, apple, juice, My, friend, likes, mango, juice]
Now, if we count only the unique words, we get:[I, like, to, drink, apple, juice, My, friend, likes, mango]
Vocabulary size = 10 (ten unique words).
If the word likes was replaced by like, the vocabulary would reduce to 9, since like would no longer be counted twice as a different form.
1.4 Words (Tokens)
Every individual element that appears in a corpus — such as I, like, juice — is referred to as a word or token.
Tokens are the building blocks of NLP.
2. What is Tokenization?
Tokenization is the process of breaking a larger piece of text (a paragraph or sentence) into smaller units called tokens.
These tokens could be sentences, words, or even subwords, depending on the level of granularity.
2.1 Why Tokenization is Important
Tokenization is the first step in text preprocessing.
Since computers cannot understand raw text, each word (token) must be converted into numerical form (vectors or embeddings).
Before that conversion, the text must be properly segmented into meaningful components — and that’s exactly what tokenization does.
3. Sentence Tokenization
Sentence tokenization involves splitting a paragraph into individual sentences.
The process typically looks for punctuation marks like periods (.), question marks (?), or exclamation points (!) to detect where one sentence ends and another begins.
Example:
Paragraph:
“My name is Arjun. I have an interest in teaching Machine Learning and NLP. I am also a YouTuber.”
After sentence tokenization, we get:
- My name is Arjun
- I have an interest in teaching Machine Learning and NLP
- I am also a YouTuber
So, from one paragraph (corpus), we extracted three sentences (documents).
4. Word Tokenization
Once a sentence is extracted, it can be further broken down into individual words.
This process is called word tokenization.
Example:
Sentence:
“I have an interest in teaching NLP.”
After word tokenization:['I', 'have', 'an', 'interest', 'in', 'teaching', 'NLP']
Each element in this list is a token.
So, depending on the level, tokens can represent sentences, words, or even subwords (in modern NLP models like BERT).
5. Hierarchy of Text Units
To summarize the hierarchy:
| Level | Description | Example |
|---|---|---|
| Corpus | Entire text dataset | Collection of articles or paragraphs |
| Document | Individual sentence or paragraph | “My name is Arjun.” |
| Sentence Token | Sentence split from a corpus | “I am also a YouTuber.” |
| Word Token | Individual words from a sentence | “Machine”, “Learning”, “NLP” |
| Vocabulary | Set of all unique tokens | {‘I’, ‘Arjun’, ‘teaching’, ‘NLP’, …} |
6. Technical Perspective
When performing NLP tasks in Python, libraries like NLTK and spaCy are commonly used for tokenization.
For example, in Python using NLTK:
from nltk.tokenize import sent_tokenize, word_tokenize
text = "My name is Arjun. I have an interest in teaching NLP."
# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Word Tokenization
words = word_tokenize(text)
print("Words:", words)
Output:
Sentences: ['My name is Arjun.', 'I have an interest in teaching NLP.'] Words: ['My', 'name', 'is', 'Arjun', '.', 'I', 'have', 'an', 'interest', 'in', 'teaching', 'NLP', '.']
This demonstrates how a paragraph can first be split into sentences, and each sentence can then be broken into tokens (words).
7. Relationship Between Corpus, Vocabulary, and Tokens
Let’s connect all concepts through an example:
Corpus:
“I like to drink apple juice. My friend likes mango juice.”
Step 1 – Sentence Tokenization:
- Sentence 1: I like to drink apple juice
- Sentence 2: My friend likes mango juice
Step 2 – Word Tokenization:
Tokens = [I, like, to, drink, apple, juice, My, friend, likes, mango, juice]
Total tokens = 11
Step 3 – Vocabulary Creation:
Unique tokens = [I, like, to, drink, apple, juice, My, friend, likes, mango]
Vocabulary size = 10
Thus:
- Corpus = the entire paragraph
- Documents = the sentences
- Tokens = the words
- Vocabulary = unique words present in the corpus
8. Summary
- Corpus – A collection of text data.
- Document – A single text unit within a corpus (like a sentence or paragraph).
- Vocabulary – Set of unique words appearing in the corpus.
- Tokenization – The process of splitting text into smaller meaningful pieces (sentences or words).
- Tokens – The output units obtained after tokenization.
Tokenization is a foundational preprocessing step in NLP pipelines.
It prepares text for further stages such as cleaning, stop word removal, stemming, lemmatization, and vectorization.
