Skip to main content

Language Modeling with N-Grams and Neural Networks

🧠 Overview​

This KI explains the core concepts of language modeling using n-grams, including bi-gram and tri-gram models, and their evolution into neural network-based language models. It also covers limitations of traditional n-grams and how neural models using embeddings and softmax improve predictive performance.


πŸ“š What Is Language Modeling?​

Language modeling is the task of predicting the next word in a sequence, given the previous word(s). It is foundational in many NLP tasks, including:

  • Text generation
  • Speech recognition
  • Machine translation

πŸ”’ N-Grams: The Basics​

➀ Definition:​

An n-gram is a contiguous sequence of n words from a given sample of text.

  • Unigram (n=1): Uses no context.
  • Bigram (n=2): Predicts based on the previous one word.
  • Trigram (n=3): Predicts based on the previous two words.

➀ Bi-Gram Example:​

"I like vacations" and "I hate surgery"

  • Context: I like β†’ Predict vacations
  • Context: I hate β†’ Predict surgery

Bi-gram model estimates:

P(vacations | like) = 1
P(surgery | like) = 0

Context window size = 1


πŸ”‚ Tri-Gram Model​

Tri-gram considers two previous words:

  • Context: surgeons like β†’ Predict surgery
  • Context: I like β†’ Predict vacations

Now the probability depends on two words, increasing specificity:

P(surgery | surgeons, like) = 1
P(surgery | I, like) = 0

Context window size = 2


βš™οΈ General N-Gram Model​

N-gram models generalize this further:

  • Context window = n - 1
  • Predict next word w_t given: w_{t-n+1}, ..., w_{t-1}

Limitation: As n increases:

  • Memory and computation cost grow
  • Training data becomes sparse

🧱 Transition to Neural Language Models​

To overcome sparsity and context limitations, we use neural networks:

πŸ”Έ One-Hot to Embedding​

Words are no longer one-hot encoded. Instead:

  • Each word is mapped to a dense vector (embedding)
  • These embeddings capture semantic similarity

πŸ”Έ Input Vector​

If context size = 2 and vocabulary size = 6:

  • You concatenate embeddings: embedding(w_{t-2}) + embedding(w_{t-1})
  • Resulting input size = 2 Γ— embedding_dim

πŸ”Έ Architecture​

A simple feedforward neural network with:

  • Input: Concatenated context embeddings
  • Hidden layer(s): With ReLU or similar activations
  • Output layer: One neuron per word in vocabulary
  • Activation: Softmax to turn scores into probabilities

πŸ”Έ Prediction​

Predict the word w_t by choosing:

argmax(P(w_t | context))

🚫 Limitations of Basic Neural N-Gram Models​

  • Fixed context window: Can't model dependencies outside it
  • No sequence awareness: Doesn’t know position/timing
  • No memory: Doesn’t track prior information like RNNs or transformers

βœ… Recap​

ConceptDescription
N-GramPredicts next word from last n-1 words
Bi-GramContext = 1 previous word
Tri-GramContext = 2 previous words
Neural N-GramLearns prediction using embeddings + feedforward NN
SoftmaxConverts raw scores to probabilities
LimitationCannot capture long-range or sequential dependencies