Skip to main content

Transformer Architecture


🎯 Objective: What Does a Transformer LLM Do?​

A Transformer-based Large Language Model (LLM) takes a natural language input (like a question or instruction) and generates meaningful output (a word, phrase, sentence, or full story).

Example Input: "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

Output: "Dear Sarah, I’m so sorry..."

This process is powered by three major components:

  • Tokenizer
  • Transformer Blocks (the core brain)
  • LM Head (for word prediction)

πŸ”€ Step 1: Tokenizer – Turning Text into Tokens​

The tokenizer is the first step. It breaks raw text into chunks the model can understand β€” usually subword tokens.

Input: "Explain how it happened."

Tokenized as:

["Explain", "how", "it", "happen", "##ed", "."]

Why subwords? Because it helps handle rare or unseen words (e.g., "gardener" β†’ "garden" + "##er").

The tokenizer also maps tokens to IDs using a token vocabulary β€” like a dictionary that assigns each token a number.


πŸ”’ Step 2: Token Embeddings – IDs Become Vectors​

Each token ID is mapped to a learned vector (called a token embedding).

  • Vocabulary: 50,000+ tokens
  • Embedding size: typically 768, 1024, etc.

So the sentence becomes a matrix of token embeddings:

["Explain"] β†’ [0.56, -0.12, ..., 0.77]
["how"] β†’ [...]
...

These embeddings are now fed into the Transformer Blocks.


🧠 Step 3: Stack of Transformer Blocks​

This is where the model β€œthinks.”

Each block contains:

  • Self-Attention Layer – lets each word attend to others
  • Feedforward Neural Network – transforms and mixes information
  • Layer Norm + Residual Connections – stabilize and preserve input/output flow

The blocks are stacked N times (e.g., 12 in BERT base, 96 in GPT-4). Each block refines the understanding of the sentence.

Think of it as passing the message through N layers of wise interpreters.

(Slide 3)


🧠 How Attention Works Inside Blocks​

Every token looks at every other token and weighs them using dot products (Q Β· K). It gets back a score and combines the values (V).

For example:

  • Token: "it"
  • Might attend more to: "happen" and "Explain" than "Write"

This lets the model understand syntax, subject-object relationships, and even tone.


🧠 Step 4: Language Modeling Head (LM Head)​

After processing, the model needs to predict the next token.

  • It uses a linear layer to project the final hidden states back to the vocabulary size (e.g., 50,000)
  • Then applies softmax to get probabilities for each possible token

Highest score β†’ output token (e.g., "Dear")

This process repeats token-by-token during generation.


πŸ”„ Full Pipeline: From Prompt to Response​

  1. Text prompt β†’ Tokenizer β†’ Tokens (IDs)
  2. Token IDs β†’ Embedding lookup
  3. Embeddings β†’ Transformer blocks (with attention)
  4. Output β†’ LM head β†’ next word
  5. Feed generated word back in β†’ repeat

🧩 Summary Diagram Components​

ComponentFunction
TokenizerBreaks input text into tokens/IDs
Token VocabularyMaps tokens to IDs
Token EmbeddingsTurns token IDs into dense vectors
Transformer BlocksContextualize and transform input
LM HeadPredicts next word/token