Created
Nov 19, 2025
Last Modified
3 months ago

NLP

Natural Language Processing (NLP)

NLP and Text Processing

NLP stands at the intersection of linguistics, computer science, and AI. It enables machines to understand, interpret, and respond to human language.

NLP applications range from chatbots and language translation to sentiment analysis and content generation. It uses computational techniques for processing and analyzing human language.


Combining Linguistics

It combines linguistic theory and computational methods. It involves developing algorithms for automatic language analysis, processing, and generation.

This field bridges the gap between language and machines, enabling meaningful interaction.

NLP is primarily used on text-based datasets.


NLP Uses and Examples

NLP is primarily used on text-based datasets.
Examples:

  • Chatbots

  • Email filters

  • Sentiment analysis

  • Machine translation


Text Processing Concepts

1. Tokenization:

Tokenization is the process of splitting a sentence or string into smaller units called tokens.

  • A token is the smallest unit of a sentence.

  • Example:

    • Sentence: “Ram is a good boy”

    • Word tokens: Ram, is, a, good, boy

    • Character tokens: R, a, m, i, s, …

2. N-gram Tokenization:

  • Splits text into fixed-size chunks of N tokens.

  • Example (bi-gram): "Ram is", "is a", "a man"

3. Stop Words:

  • Words with little or no meaning in a sentence are called stop words.

  • Common examples: the, is, in, over

  • Removing stop words helps in:

    • Text classification

    • Information retrieval

    • Search engine keyword extraction

Example:

  • Sentence: “The quick brown fox jumps over the lazy dog”

  • Stop words: the, over

  • Content words: quick, brown, fox, jumps, lazy, dog

4. Lemmatization

Lemmatization performs morphological analysis.
It is the process of grouping together different inflected forms of a word so they can be analyzed as a single item (lemma).

Applications:
  • Used in compact indexing

  • Used in comprehensive retrieval systems, such as search engines

Advantages:
  • Helps focus on meaningful words rather than different word forms

Disadvantages:
  • Can be slow for large datasets

5. Stemming

  • Definition: Stemming is the process of reducing words to their root form.

  • Example:
    finally → final, finalize → fina

  • Advantages: Fast and simple

  • Disadvantages: Removes meaning sometimes; may produce non-words

  • Purpose: Reduces words to a common base form for easier processing.

A Detailed Study on Stemming vs Lemmatization In Python


Bag of Words (BoW)

  • NLP data (text) needs to be converted into numerical form for machine learning.

  • Bag of Words Model:

    • Converts text (sentence, paragraph, document) into a collection of words

    • Ignores word order and grammar

    • Focuses on word frequency

  • Applications: Text classification, sentiment analysis, clustering

Example:

Text:
"He's a good boy. She is a good girl. Boys and girls are good."

Vocabulary Count:

Word

Count

good

3

boy

2

girl

2

BoW Representation (f₁=good, f₂=boy, f₃=girl):

good

boy 

girl 

1

1

0

1

0

1

1

1

1


N-Gram

  • Definition: An N-gram is a contiguous sequence of N items (words or characters) from text.
    "This is a sentence"

  • Types:

    • Unigram: Single words
      "This", "is", "a", "sentence"

    • Bigram: Two consecutive words
      "This is", "is a", "a sentence"

    • Trigram: Three consecutive words
      "This is a", "is a sentence"

  • Uses: Captures context, improves language models, enhances text prediction, information retrieval