π€ Transformers and Attention β Explained with Visuals
π§ Why Were Transformers Created?β
Before transformers, sequence models like RNNs and LSTMs were used to process language. But they had limitations:
- π Slow: RNNs process one word at a time (no parallelism)
- π§ Memory bottlenecks: They struggle with long-range dependencies
- ποΈ Hard to scale
Then came the Transformer architecture (Vaswani et al., 2017), with its famous tagline:
"Attention is All You Need"
Instead of processing tokens sequentially, transformers use self-attention to process the entire input at once, capturing dependencies across the sequence.
π What Is Attention? (Recap)β
Attention is the idea that:
Some parts of the input are more relevant to each output than others.
Instead of compressing everything into a single fixed vector, attention lets each word βlook aroundβ at other words and weigh them.
βThe cat sat on the matβ β to understand βsat,β the model attends to βcat.β
βοΈ Anatomy of a Transformer (Slide 1β2)β
Encoder Side (Understanding Input)β
Input: "I love llamas"
Each word is embedded and passed into multiple encoder layers, each with:
- π Self-Attention β each word attends to all other words in the sentence
- βοΈ Feedforward Network β learns complex representations
This creates contextualized embeddings: βloveβ knows it relates to βIβ and βllamas.β
Decoder Side (Generating Output)β
The decoder takes:
- Previously generated words β e.g. βIkβ, βhouβ (Dutch for "I love")
- Plus encoded context (from the encoder)
And generates the next word β βvan.β
The decoder includes:
- πΆοΈ Masked Self-Attention (explained next)
- π Encoder-Decoder Attention (uses encoder output)
- βοΈ Feedforward NN
πΆοΈ What Is Masked Self-Attention? (Slide 2)β
You donβt want a model to βcheatβ by looking at future words during generation.
Masked self-attention prevents the model from peeking ahead. When generating the third word, it only sees the first two:
Generating Word | Can Attend To |
---|---|
1st | [Start token] |
2nd | [1st] |
3rd | [1st, 2nd] |
Itβs how autoregressive models like GPT maintain left-to-right generation.
π§± Encoder-Only vs Decoder-Only vs Encoder-Decoder (Slide 3β6)β
π’ Encoder-Only β BERT (Representation Model)β
- Task: Understand text (classification, QA, embeddings)
- Uses bidirectional self-attention (sees left + right context)
- Trained with masked language modeling (Slide 4):
- Randomly mask a word
- Predict the missing word (e.g., "I [MASK] llamas" β "am")
Fine-tuned for downstream tasks (Slide 5):
- Sentiment classification
- Named entity recognition
- Sentence similarity
π΄ Decoder-Only β GPT (Generative Model)β
- Task: Generate text (completion, stories, chat)
- Uses masked self-attention only (left-to-right)
- No encoder at all
π· Encoder-Decoder β T5, BART, Translation modelsβ
- Input goes through the encoder
- Decoder generates outputs using attention over the encoder
- Used for: translation, summarization, question answering
π€Ή Summary Tableβ
Model Type | Uses Encoder | Uses Decoder | Attention Type | Example Models |
---|---|---|---|---|
Encoder-Only | β Yes | β No | Bidirectional Self-Attention | BERT, RoBERTa |
Decoder-Only | β No | β Yes | Masked Self-Attention | GPT, GPT-2/3/4 |
Encoder-Decoder | β Yes | β Yes | Self-Attn + Encoder-Dec Attn | T5, BART, MT5 |
π₯ Why Transformers Winβ
- β Parallelizable (not step-by-step like RNNs)
- β Scales well with data and compute
- β Captures global context with attention
- β Foundation of LLMs like GPT, BERT, Claude, Gemini, etc.