π What Is Self-Attention?
Self-attention allows a model to look at all the other words in a sentence (or a document, or code...) and decide how important each of them is for understanding a particular word.
Imagine reading:
"The dog chased the llama because it was fast."
You need to ask:
π§ "Does it refer to the dog or the llama?"
Thatβs where self-attention comes in. It helps assign relevance scores between tokens so the model can understand these dependencies.
π§ How It Works: Step-by-Stepβ
1. Projection into Q, K, V (Query, Key, Value)β
Every token in the input sequence is passed through three learned linear transformations:
- Query (Q): What am I looking for?
- Key (K): What does each word offer?
- Value (V): What content do I retrieve if a match is found?
Each token gets its own Q, K, and V vectors.
2. Attention Scores via Dot Productβ
To compute how much attention one word pays to others:
- Compute dot product between Q of the current word and K of all words.
- Scale the result by the square root of the key dimension \sqrtd_k$$.
- Apply softmax to turn scores into weights.
- Multiply weights with V to get the final contextual embedding.
π Full Formulaβ
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac$QK^T${\sqrt$d_k$}\right)V- Q \in \mathbb{R}^n \times d_k$$
- K \in \mathbb{R}^n \times d_k$$
- V \in \mathbb{R}^n \times d_v$$
- = sequence length, = key/query dim, = value dim
The result is a matrix of contextualized embeddings: one for each word, now aware of its neighbors.
π‘ Why Divide by βd_k?β
Without scaling, large dot product values would cause the softmax to become too peaky, resulting in near one-hot distributions.
The division stabilizes gradients and ensures better learning.
π§ͺ Simple Conceptual Exampleβ
For a 3-token input:
Tokens = "I"
, "love"
, "cats"
Each token gets Q, K, V (vectors like [0.1, 0.3]
)
- Compute
Q Γ Kα΅
β similarity scores - Apply
softmax
β attention weights (e.g.,[0.2, 0.3, 0.5]
) - Multiply by V β weighted sum β contextual embedding for token
π Multi-Head Attentionβ
Instead of just one set of Q/K/V projections, the model uses multiple βheadsβ:
- Each head learns different relationships (e.g., grammar, meaning)
- The outputs of all heads are concatenated and linearly transformed
This lets the model capture different types of context simultaneously.
π§ Why Self-Attention Mattersβ
- π¬ Understands global relationships, not just nearby words
- β‘ Fully parallelizable (unlike RNNs)
- π§ Easily scaled up (transformers can handle huge sequences)
- π Enables contextual embeddings β words are represented in context
β±οΈ Positional Encodingβ
Because attention alone is order-agnostic, we inject position info:
- π Sinusoidal positional encoding (Transformer paper)
- π§ Learnable positional embeddings (BERT, GPT)
These are added to the token embeddings before attention so the model knows word order.
π§ͺ Mini Practicum: Predicting βhateβ from βnot likeββ
Input: "not like"
Model should predict: "hate"
Steps:
- Embed tokens β
[x_not, x_like]
- Compute
Q = x Γ W_Q
,K = x Γ W_K
,V = x Γ W_V
- Get
attention_weights = softmax(Q Γ Kα΅ / βd_k)
- Compute
H' = attention_weights Γ V
- Final output:
H = H' Γ W_o
- Pass through linear classifier β vocabulary β predict
"hate"
π€ Summaryβ
Self-attention lets a model:
- Compare all tokens with all others via dot product
- Learn long-range dependencies
- Replace sequential memory with direct contextual lookup
- Power the core of the Transformer architecture