Skip to main content

πŸ€– Transformers and Attention – Explained with Visuals


🧠 Why Were Transformers Created?​

Before transformers, sequence models like RNNs and LSTMs were used to process language. But they had limitations:

  • 🐌 Slow: RNNs process one word at a time (no parallelism)
  • 🧠 Memory bottlenecks: They struggle with long-range dependencies
  • πŸ—οΈ Hard to scale

Then came the Transformer architecture (Vaswani et al., 2017), with its famous tagline:

"Attention is All You Need"

Instead of processing tokens sequentially, transformers use self-attention to process the entire input at once, capturing dependencies across the sequence.


πŸ” What Is Attention? (Recap)​

Attention is the idea that:

Some parts of the input are more relevant to each output than others.

Instead of compressing everything into a single fixed vector, attention lets each word β€œlook around” at other words and weigh them.

β€œThe cat sat on the mat” β†’ to understand β€œsat,” the model attends to β€œcat.”


βš™οΈ Anatomy of a Transformer (Slide 1–2)​

Encoder Side (Understanding Input)​

Input: "I love llamas"

Each word is embedded and passed into multiple encoder layers, each with:

  • πŸ” Self-Attention – each word attends to all other words in the sentence
  • βš™οΈ Feedforward Network – learns complex representations

This creates contextualized embeddings: β€œlove” knows it relates to β€œI” and β€œllamas.”

Decoder Side (Generating Output)​

The decoder takes:

  • Previously generated words β†’ e.g. β€œIk”, β€œhou” (Dutch for "I love")
  • Plus encoded context (from the encoder)

And generates the next word β€” β€œvan.”

The decoder includes:

  • πŸ•ΆοΈ Masked Self-Attention (explained next)
  • πŸ”„ Encoder-Decoder Attention (uses encoder output)
  • βš™οΈ Feedforward NN

πŸ•ΆοΈ What Is Masked Self-Attention? (Slide 2)​

You don’t want a model to β€œcheat” by looking at future words during generation.

Masked self-attention prevents the model from peeking ahead. When generating the third word, it only sees the first two:

Generating WordCan Attend To
1st[Start token]
2nd[1st]
3rd[1st, 2nd]

It’s how autoregressive models like GPT maintain left-to-right generation.


🧱 Encoder-Only vs Decoder-Only vs Encoder-Decoder (Slide 3–6)​

🟒 Encoder-Only β†’ BERT (Representation Model)​

  • Task: Understand text (classification, QA, embeddings)
  • Uses bidirectional self-attention (sees left + right context)
  • Trained with masked language modeling (Slide 4):
    • Randomly mask a word
    • Predict the missing word (e.g., "I [MASK] llamas" β†’ "am")

Fine-tuned for downstream tasks (Slide 5):

  • Sentiment classification
  • Named entity recognition
  • Sentence similarity

πŸ”΄ Decoder-Only β†’ GPT (Generative Model)​

  • Task: Generate text (completion, stories, chat)
  • Uses masked self-attention only (left-to-right)
  • No encoder at all

πŸ”· Encoder-Decoder β†’ T5, BART, Translation models​

  • Input goes through the encoder
  • Decoder generates outputs using attention over the encoder
  • Used for: translation, summarization, question answering

🀹 Summary Table​

Model TypeUses EncoderUses DecoderAttention TypeExample Models
Encoder-Onlyβœ… Yes❌ NoBidirectional Self-AttentionBERT, RoBERTa
Decoder-Only❌ Noβœ… YesMasked Self-AttentionGPT, GPT-2/3/4
Encoder-Decoderβœ… Yesβœ… YesSelf-Attn + Encoder-Dec AttnT5, BART, MT5

πŸ”₯ Why Transformers Win​

  • βœ… Parallelizable (not step-by-step like RNNs)
  • βœ… Scales well with data and compute
  • βœ… Captures global context with attention
  • βœ… Foundation of LLMs like GPT, BERT, Claude, Gemini, etc.