π§ Encoders, Decoders, and Attention
π What Are Encoders and Decoders?β
Encoders and decoders are the core building blocks in many modern neural network architectures, especially for tasks involving sequences β like translation, summarization, or text generation.
Think of them as a two-part system:
- π§ The encoder reads and understands the input.
- π£οΈ The decoder generates the output based on that understanding.
They first appeared together in Sequence-to-Sequence (Seq2Seq) models, and are foundational to transformers, BERT, GPT, T5, and more.
π What Does an Encoder Do?β
The encoder takes the input sequence and compresses it into a meaningful representation (often called a "context vector" or "embedding").
Example:β
Input: "The cat sat on the mat."
The encoder:
- Tokenizes the sentence
- Embeds each token (word β vector)
- Processes the vectors using layers (like RNNs or transformers)
- Outputs a sequence of context-rich vectors that summarize the input
π‘ In transformers, the encoder outputs a vector per input token, all infused with contextual relationships.
π€ What Does a Decoder Do?β
The decoder generates the output one token at a time, using the encoded input and previously generated tokens.
Example:β
Task: Translate "The cat sat on the mat."
to French
Decoder starts with:
- [
<sos>
] (start of sentence token) - Predicts "Le"
- Feeds "Le" back in, predicts "chat"
π Itβs autoregressive: each output depends on previous outputs + the encoded input.
π§ Why Is Attention So Important?β
Attention solves a key problem:
π§± Not all words in a sentence are equally important.
Traditional models (like LSTMs) squish everything into one final vector, which loses detail. Attention lets the model focus on different parts of the input when generating each output token.
Example:β
To translate βThe cat sat on the mat,β the word βsatβ in French depends most on βcatβ and βsatβ β not βthe.β
Attention lets the decoder dynamically attend to the relevant parts of the input.
π§² How Attention Works (Simplified)β
Each word in the input gets turned into three vectors:
- Query (Q) β What am I looking for?
- Key (K) β What do I have?
- Value (V) β What can I use if thereβs a match?
The model compares the query to all the keys (via dot products) to get attention scores, then uses those scores to weight the values.
Output = weighted sum of values (with most attention paid to the relevant inputs)
This is the magic that lets transformers understand complex relationships β like subject/verb links or nested clauses β in a single step.
ποΈ Encoder-Decoder Architectures in Practiceβ
π§Ύ Classic Seq2Seq (RNN-based)β
- Encoder = LSTM/GRU processes input
- Decoder = LSTM/GRU generates output
- Problem: bottleneck in final encoder state
π₯ Transformer (like T5, BART)β
- Encoder: multiple layers of self-attention
- Decoder: layers with self-attention + encoder-decoder attention
- More parallelizable, more accurate
π§ββοΈ GPT (Decoder-only Transformer)β
- No encoder β only decoder with masked self-attention
- Great for generation tasks (chat, stories, code)
π§± BERT (Encoder-only Transformer)β
- Only the encoder stack
- Great for understanding tasks (classification, QA, embeddings)
π‘ Summaryβ
Component | Role | Used In |
---|---|---|
Encoder | Encodes input into context vectors | BERT, T5, translation models |
Decoder | Autoregressively generates output | GPT, T5, BART |
Attention | Lets model focus on relevant input parts | All modern transformers |
π Why This Mattersβ
Understanding encoders, decoders, and attention unlocks your ability to:
- Read and build transformer architectures
- Understand how models like GPT or BERT work under the hood
- Design your own LLM workflows (e.g. RAG, fine-tuning, prompt chaining)