Transformer Architecture
π― Objective: What Does a Transformer LLM Do?β
A Transformer-based Large Language Model (LLM) takes a natural language input (like a question or instruction) and generates meaningful output (a word, phrase, sentence, or full story).
Example Input: "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."
Output: "Dear Sarah, Iβm so sorry..."
This process is powered by three major components:
- Tokenizer
- Transformer Blocks (the core brain)
- LM Head (for word prediction)
π€ Step 1: Tokenizer β Turning Text into Tokensβ
The tokenizer is the first step. It breaks raw text into chunks the model can understand β usually subword tokens.
Input: "Explain how it happened."
Tokenized as:
["Explain", "how", "it", "happen", "##ed", "."]
Why subwords? Because it helps handle rare or unseen words (e.g., "gardener" β "garden" + "##er").
The tokenizer also maps tokens to IDs using a token vocabulary β like a dictionary that assigns each token a number.
π’ Step 2: Token Embeddings β IDs Become Vectorsβ
Each token ID is mapped to a learned vector (called a token embedding).
- Vocabulary: 50,000+ tokens
- Embedding size: typically 768, 1024, etc.
So the sentence becomes a matrix of token embeddings:
["Explain"] β [0.56, -0.12, ..., 0.77]
["how"] β [...]
...
These embeddings are now fed into the Transformer Blocks.
π§ Step 3: Stack of Transformer Blocksβ
This is where the model βthinks.β
Each block contains:
- Self-Attention Layer β lets each word attend to others
- Feedforward Neural Network β transforms and mixes information
- Layer Norm + Residual Connections β stabilize and preserve input/output flow
The blocks are stacked N times (e.g., 12 in BERT base, 96 in GPT-4). Each block refines the understanding of the sentence.
Think of it as passing the message through N layers of wise interpreters.
(Slide 3)
π§ How Attention Works Inside Blocksβ
Every token looks at every other token and weighs them using dot products (Q Β· K). It gets back a score and combines the values (V).
For example:
- Token: "it"
- Might attend more to: "happen" and "Explain" than "Write"
This lets the model understand syntax, subject-object relationships, and even tone.
π§ Step 4: Language Modeling Head (LM Head)β
After processing, the model needs to predict the next token.
- It uses a linear layer to project the final hidden states back to the vocabulary size (e.g., 50,000)
- Then applies softmax to get probabilities for each possible token
Highest score β output token (e.g., "Dear")
This process repeats token-by-token during generation.
π Full Pipeline: From Prompt to Responseβ
- Text prompt β Tokenizer β Tokens (IDs)
- Token IDs β Embedding lookup
- Embeddings β Transformer blocks (with attention)
- Output β LM head β next word
- Feed generated word back in β repeat
π§© Summary Diagram Componentsβ
Component | Function |
---|---|
Tokenizer | Breaks input text into tokens/IDs |
Token Vocabulary | Maps tokens to IDs |
Token Embeddings | Turns token IDs into dense vectors |
Transformer Blocks | Contextualize and transform input |
LM Head | Predicts next word/token |