1. Introduction to Named Entity Recognition (NER)
Named Entity Recognition (NER) is a subtask of Natural Language Processing (NLP) that identifies specific types of entities in text and classifies them into predefined categories such as:
- PERSON – Names of people (e.g., Amitabh Bachchan, Albert Einstein)
- ORGANIZATION – Institutions or companies (e.g., Google, Indian Railways)
- LOCATION – Cities, countries, or landmarks (e.g., Mumbai, Himalayas, USA)
- DATE – Specific dates or periods (e.g., 1st November 2025)
- MONEY – Monetary amounts (e.g., ₹50,000, $1000)
- TIME – Time expressions (e.g., 5 PM, two hours ago)
For example:
Text: “Ratan Tata is the former chairman of Tata Sons, headquartered in Mumbai.”
Entities:
- Ratan Tata → PERSON
- Tata Sons → ORGANIZATION
- Mumbai → LOCATION
2. Why NER Is Important
Most real-world data—news, reports, and emails—is unstructured. NER transforms this text into structured information by identifying key entities.
Key use cases:
- Information Extraction – Identify important entities like names or locations in documents.
- Content Categorization – Tag news articles by organization or person.
- Search Optimization – Improve search relevance using entity recognition.
- Chatbots – Extract user intents like locations or dates.
- Finance and Business – Identify company names, stock tickers, and transaction values.
- Healthcare – Detect diseases, drugs, or medical conditions from clinical text.
3. How NER Works in NLTK
The NLTK (Natural Language Toolkit) library provides basic NER functionality using a pre-trained classifier.
It uses the Stanford Named Entity Recognizer (for advanced setups) or an internal NER chunker based on decision trees and part-of-speech tagging.
The general steps for NER in NLTK are:
- Tokenization – Split text into words.
- POS Tagging – Assign part-of-speech tags (noun, verb, etc.).
- Chunking – Use grammatical patterns to identify entities.
- Named Entity Recognition – Classify entities as PERSON, ORGANIZATION, etc.
4. Setting Up NLTK
Step 1: Install and Import NLTK
pip install nltk
Then, in Python:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
These downloads provide:
- punkt – Tokenizer
- averaged_perceptron_tagger – POS tagger
- maxent_ne_chunker – NER chunker
- words – Word corpus for dictionary reference
5. Performing Named Entity Recognition in NLTK
Let’s walk through the steps with a simple example.
Step 1: Input Text
text = "Sundar Pichai is the CEO of Google and lives in California."
Step 2: Tokenization
from nltk import word_tokenize tokens = word_tokenize(text) print(tokens)
Output:
['Sundar', 'Pichai', 'is', 'the', 'CEO', 'of', 'Google', 'and', 'lives', 'in', 'California', '.']
Step 3: Part-of-Speech (POS) Tagging
from nltk import pos_tag pos_tags = pos_tag(tokens) print(pos_tags)
Output:
[('Sundar', 'NNP'), ('Pichai', 'NNP'), ('is', 'VBZ'), ('the', 'DT'),
('CEO', 'NNP'), ('of', 'IN'), ('Google', 'NNP'), ('and', 'CC'),
('lives', 'VBZ'), ('in', 'IN'), ('California', 'NNP'), ('.', '.')]
Explanation:
NNP→ Proper noun (singular)VBZ→ Verb (3rd person singular present)DT→ Determiner
Step 4: Named Entity Chunking
from nltk import ne_chunk chunks = ne_chunk(pos_tags) print(chunks)
Output (tree structure):
(S (PERSON Sundar/NNP Pichai/NNP) is/VBZ the/DT CEO/NNP of/IN (ORGANIZATION Google/NNP) and/CC lives/VBZ in/IN (GPE California/NNP) ./.)
Explanation:
(PERSON Sundar Pichai)→ Person name(ORGANIZATION Google)→ Organization(GPE California)→ Geopolitical Entity (location)
6. Visualizing the Named Entity Tree
NLTK can visualize the hierarchical entity structure using its tree visualizer.
chunks.draw()
A GUI window will open showing a tree with labeled entities (PERSON, ORGANIZATION, GPE).
7. Extracting Named Entities Programmatically
You can programmatically extract all recognized entities using this function:
from nltk.tree import Tree
def extract_named_entities(tree):
entities = []
for subtree in tree:
if isinstance(subtree, Tree):
entity_name = " ".join([token for token, pos in subtree.leaves()])
entity_type = subtree.label()
entities.append((entity_name, entity_type))
return entities
entities = extract_named_entities(chunks)
print(entities)
Output:
[('Sundar Pichai', 'PERSON'), ('Google', 'ORGANIZATION'), ('California', 'GPE')]
8. NER Example with Multiple Sentences
paragraph = """Elon Musk founded SpaceX in 2002.
He also leads Tesla Motors, headquartered in California."""
sentences = nltk.sent_tokenize(paragraph)
for sentence in sentences:
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
tree = nltk.ne_chunk(pos_tags)
print(extract_named_entities(tree))
Output:
[('Elon Musk', 'PERSON'), ('SpaceX', 'ORGANIZATION')]
[('Tesla Motors', 'ORGANIZATION'), ('California', 'GPE')]
9. How NLTK’s NER Works Internally
NLTK’s built-in NER is based on a maximum entropy classifier, which uses:
- Contextual word features
- POS tags
- Capitalization patterns
- Surrounding words
This model was trained on the ACE (Automatic Content Extraction) and CoNLL-2002/2003 corpora.
However, its accuracy is lower than modern transformer-based NER models like BERT or spaCy’s neural pipelines.
