Understanding Basic NLP Terminologies and Tokenization

Natural Language Processing (NLP) is a fundamental area of Artificial Intelligence that focuses on enabling computers to understand, interpret, and generate human language. Before diving into complex NLP concepts, it’s essential to understand some foundational terms and one of the most important preprocessing steps — tokenization.

This tutorial introduces basic terminologies used throughout NLP and then explains what tokenization is, why it is needed, and how it forms the basis of text preprocessing.

1. Key NLP Terminologies

When working with text data, you’ll frequently encounter terms such as corpus, document, vocabulary, and tokens. Let’s clearly define each of them with examples.

1.1 Corpus

A corpus (plural: corpora) refers to a collection of textual data. It can be a single paragraph, a collection of documents, or even an entire dataset used for analysis.

Example:
If you have a paragraph like:

“My name is Arjun. I have an interest in teaching machine learning and NLP.”

Then this entire paragraph can be referred to as a corpus.
In practical NLP applications, corpora may include large collections such as thousands of articles, emails, or tweets.

1.2 Document

A document is a single piece of text within a corpus. It could be a single sentence, a paragraph, or even an entire article, depending on the context.

Example:
In the above corpus, each sentence such as

“My name is Arjun.”
can be treated as one document.
So, a corpus may contain multiple documents.

1.3 Vocabulary

A vocabulary represents the set of all unique words present in a corpus.
It’s similar to a dictionary containing every distinct word that appears at least once in the text.

Example:
Consider these two sentences:

I like to drink apple juice.
My friend likes mango juice.

All words combined:
[I, like, to, drink, apple, juice, My, friend, likes, mango, juice]

Now, if we count only the unique words, we get:
[I, like, to, drink, apple, juice, My, friend, likes, mango]
Vocabulary size = 10 (ten unique words).

If the word likes was replaced by like, the vocabulary would reduce to 9, since like would no longer be counted twice as a different form.

1.4 Words (Tokens)

Every individual element that appears in a corpus — such as I, like, juice — is referred to as a word or token.
Tokens are the building blocks of NLP.

2. What is Tokenization?

Tokenization is the process of breaking a larger piece of text (a paragraph or sentence) into smaller units called tokens.
These tokens could be sentences, words, or even subwords, depending on the level of granularity.

2.1 Why Tokenization is Important

Tokenization is the first step in text preprocessing.
Since computers cannot understand raw text, each word (token) must be converted into numerical form (vectors or embeddings).
Before that conversion, the text must be properly segmented into meaningful components — and that’s exactly what tokenization does.

3. Sentence Tokenization

Sentence tokenization involves splitting a paragraph into individual sentences.
The process typically looks for punctuation marks like periods (.), question marks (?), or exclamation points (!) to detect where one sentence ends and another begins.

Example:
Paragraph:

“My name is Arjun. I have an interest in teaching Machine Learning and NLP. I am also a YouTuber.”

After sentence tokenization, we get:

My name is Arjun
I have an interest in teaching Machine Learning and NLP
I am also a YouTuber

So, from one paragraph (corpus), we extracted three sentences (documents).

4. Word Tokenization

Once a sentence is extracted, it can be further broken down into individual words.
This process is called word tokenization.

Example:
Sentence:

“I have an interest in teaching NLP.”

After word tokenization:
['I', 'have', 'an', 'interest', 'in', 'teaching', 'NLP']

Each element in this list is a token.

So, depending on the level, tokens can represent sentences, words, or even subwords (in modern NLP models like BERT).

5. Hierarchy of Text Units

To summarize the hierarchy:

Level	Description	Example
Corpus	Entire text dataset	Collection of articles or paragraphs
Document	Individual sentence or paragraph	“My name is Arjun.”
Sentence Token	Sentence split from a corpus	“I am also a YouTuber.”
Word Token	Individual words from a sentence	“Machine”, “Learning”, “NLP”
Vocabulary	Set of all unique tokens	{‘I’, ‘Arjun’, ‘teaching’, ‘NLP’, …}

6. Technical Perspective

When performing NLP tasks in Python, libraries like NLTK and spaCy are commonly used for tokenization.

For example, in Python using NLTK:

from nltk.tokenize import sent_tokenize, word_tokenize

text = "My name is Arjun. I have an interest in teaching NLP."

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Word Tokenization
words = word_tokenize(text)
print("Words:", words)

Output:

Sentences: ['My name is Arjun.', 'I have an interest in teaching NLP.']
Words: ['My', 'name', 'is', 'Arjun', '.', 'I', 'have', 'an', 'interest', 'in', 'teaching', 'NLP', '.']

This demonstrates how a paragraph can first be split into sentences, and each sentence can then be broken into tokens (words).

7. Relationship Between Corpus, Vocabulary, and Tokens

Let’s connect all concepts through an example:

Corpus:

“I like to drink apple juice. My friend likes mango juice.”

Step 1 – Sentence Tokenization:

Sentence 1: I like to drink apple juice
Sentence 2: My friend likes mango juice

Step 2 – Word Tokenization:
Tokens = [I, like, to, drink, apple, juice, My, friend, likes, mango, juice]
Total tokens = 11

Step 3 – Vocabulary Creation:
Unique tokens = [I, like, to, drink, apple, juice, My, friend, likes, mango]
Vocabulary size = 10

Thus:

Corpus = the entire paragraph
Documents = the sentences
Tokens = the words
Vocabulary = unique words present in the corpus

8. Summary

Corpus – A collection of text data.
Document – A single text unit within a corpus (like a sentence or paragraph).
Vocabulary – Set of unique words appearing in the corpus.
Tokenization – The process of splitting text into smaller meaningful pieces (sentences or words).
Tokens – The output units obtained after tokenization.

Tokenization is a foundational preprocessing step in NLP pipelines.
It prepares text for further stages such as cleaning, stop word removal, stemming, lemmatization, and vectorization.