Named Entity Recognition

1. Introduction to Named Entity Recognition (NER)

Named Entity Recognition (NER) is a subtask of Natural Language Processing (NLP) that identifies specific types of entities in text and classifies them into predefined categories such as:

PERSON – Names of people (e.g., Amitabh Bachchan, Albert Einstein)
ORGANIZATION – Institutions or companies (e.g., Google, Indian Railways)
LOCATION – Cities, countries, or landmarks (e.g., Mumbai, Himalayas, USA)
DATE – Specific dates or periods (e.g., 1st November 2025)
MONEY – Monetary amounts (e.g., ₹50,000, $1000)
TIME – Time expressions (e.g., 5 PM, two hours ago)

For example:

Text: “Ratan Tata is the former chairman of Tata Sons, headquartered in Mumbai.”
Entities:

Ratan Tata → PERSON

Tata Sons → ORGANIZATION

Mumbai → LOCATION

2. Why NER Is Important

Most real-world data—news, reports, and emails—is unstructured. NER transforms this text into structured information by identifying key entities.

Key use cases:

Information Extraction – Identify important entities like names or locations in documents.
Content Categorization – Tag news articles by organization or person.
Search Optimization – Improve search relevance using entity recognition.
Chatbots – Extract user intents like locations or dates.
Finance and Business – Identify company names, stock tickers, and transaction values.
Healthcare – Detect diseases, drugs, or medical conditions from clinical text.

3. How NER Works in NLTK

The NLTK (Natural Language Toolkit) library provides basic NER functionality using a pre-trained classifier.
It uses the Stanford Named Entity Recognizer (for advanced setups) or an internal NER chunker based on decision trees and part-of-speech tagging.

The general steps for NER in NLTK are:

Tokenization – Split text into words.
POS Tagging – Assign part-of-speech tags (noun, verb, etc.).
Chunking – Use grammatical patterns to identify entities.
Named Entity Recognition – Classify entities as PERSON, ORGANIZATION, etc.

4. Setting Up NLTK

Step 1: Install and Import NLTK

pip install nltk

Then, in Python:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

These downloads provide:

punkt – Tokenizer
averaged_perceptron_tagger – POS tagger
maxent_ne_chunker – NER chunker
words – Word corpus for dictionary reference

5. Performing Named Entity Recognition in NLTK

Let’s walk through the steps with a simple example.

Step 1: Input Text

text = "Sundar Pichai is the CEO of Google and lives in California."

Step 2: Tokenization

from nltk import word_tokenize

tokens = word_tokenize(text)
print(tokens)

Output:

['Sundar', 'Pichai', 'is', 'the', 'CEO', 'of', 'Google', 'and', 'lives', 'in', 'California', '.']

Step 3: Part-of-Speech (POS) Tagging

from nltk import pos_tag

pos_tags = pos_tag(tokens)
print(pos_tags)

Output:

[('Sundar', 'NNP'), ('Pichai', 'NNP'), ('is', 'VBZ'), ('the', 'DT'),
 ('CEO', 'NNP'), ('of', 'IN'), ('Google', 'NNP'), ('and', 'CC'),
 ('lives', 'VBZ'), ('in', 'IN'), ('California', 'NNP'), ('.', '.')]

Explanation:

NNP → Proper noun (singular)
VBZ → Verb (3rd person singular present)
DT → Determiner

Step 4: Named Entity Chunking

from nltk import ne_chunk

chunks = ne_chunk(pos_tags)
print(chunks)

Output (tree structure):

(S
  (PERSON Sundar/NNP Pichai/NNP)
  is/VBZ
  the/DT
  CEO/NNP
  of/IN
  (ORGANIZATION Google/NNP)
  and/CC
  lives/VBZ
  in/IN
  (GPE California/NNP)
  ./.)

Explanation:

(PERSON Sundar Pichai) → Person name
(ORGANIZATION Google) → Organization
(GPE California) → Geopolitical Entity (location)

6. Visualizing the Named Entity Tree

NLTK can visualize the hierarchical entity structure using its tree visualizer.

chunks.draw()

A GUI window will open showing a tree with labeled entities (PERSON, ORGANIZATION, GPE).

7. Extracting Named Entities Programmatically

You can programmatically extract all recognized entities using this function:

from nltk.tree import Tree

def extract_named_entities(tree):
    entities = []
    for subtree in tree:
        if isinstance(subtree, Tree):
            entity_name = " ".join([token for token, pos in subtree.leaves()])
            entity_type = subtree.label()
            entities.append((entity_name, entity_type))
    return entities

entities = extract_named_entities(chunks)
print(entities)

Output:

[('Sundar Pichai', 'PERSON'), ('Google', 'ORGANIZATION'), ('California', 'GPE')]

8. NER Example with Multiple Sentences

paragraph = """Elon Musk founded SpaceX in 2002. 
               He also leads Tesla Motors, headquartered in California."""
               
sentences = nltk.sent_tokenize(paragraph)

for sentence in sentences:
    tokens = nltk.word_tokenize(sentence)
    pos_tags = nltk.pos_tag(tokens)
    tree = nltk.ne_chunk(pos_tags)
    print(extract_named_entities(tree))

Output:

[('Elon Musk', 'PERSON'), ('SpaceX', 'ORGANIZATION')]
[('Tesla Motors', 'ORGANIZATION'), ('California', 'GPE')]

9. How NLTK’s NER Works Internally

NLTK’s built-in NER is based on a maximum entropy classifier, which uses:

Contextual word features
POS tags
Capitalization patterns
Surrounding words

This model was trained on the ACE (Automatic Content Extraction) and CoNLL-2002/2003 corpora.
However, its accuracy is lower than modern transformer-based NER models like BERT or spaCy’s neural pipelines.