📘 Knowledge Item: Retrieval-Augmented Generation (RAG) – Synthesized Summary
Source: Galileo – Mastering RAG (NotebookLM excerpts)
Category: AI Engineering, Applied LLMs
Level: Intermediate–Advanced
🔍 What is RAG?
Retrieval-Augmented Generation (RAG) is a method for enhancing a Large Language Model’s output by supplying it with external, relevant information at inference time.
It sits at the intersection of search and generation, combining retrieval models with LLMs to improve factual accuracy, reduce hallucinations, and enable dynamic knowledge injection.
🚨 Why RAG?
Problem | How RAG Helps |
---|---|
Hallucinations | Provides factual grounding from external sources |
Data Gaps | Bridges access to private, proprietary, or up-to-date data |
Limited Context Windows | Retrieves only the most relevant chunks |
High Cost of Fine-tuning | Avoids re-training by retrieving new info dynamically |
Rapidly Changing Environments | Continuously queries live knowledge sources |
🧱 RAG System Architecture
🔁 Core Flow: Retrieve → Read (Generate)
Advanced production systems may use a full pipeline:
Rewrite → Retrieve → Rerank → Refine → Insert → Generate (+ Verify)
🔬 Pipeline Breakdown
1. Data Preparation
- Chunking: Break large texts into manageable parts based on structure
- Embedding: Encode chunks using models like Sentence Transformers, BGE, or OpenAI
- Vector DB: Store chunks in vector stores (e.g. Pinecone, Weaviate)
- Metadata: Attach tags (e.g. topic, author) for filtered search
2. Query Rewrite
- Rephrase unclear or verbose queries
- Inject conversational history
- Use Chain-of-Thought or ReAct-style prompting
3. Retrieval
- Dense (embedding-based): Semantic similarity via vector search (e.g. HNSW)
- Sparse (keyword-based): Traditional IR (e.g. BM25)
- Hybrid: Combine both for broader recall
- Generative: LLM generates document IDs (low scale)
- GraphRAG: Query over knowledge graphs extracted from text
4. Rerank
- Use cross-encoders or LLMs as rerankers
- Measure query-document relevance more deeply
- Reduce hallucination and boost in-domain quality
5. Refine
- Summarize retrieved text (extractive or abstractive)
- Use CoN (Chain-of-Note) to integrate multi-document relevance
- Shrink for context window limits
6. Insert
- Strategically place retrieved data into LLM prompt
- Maintain coherence, relevance, and logical flow
7. Generate
- LLM produces final output
- May use interleaved generation: e.g. FLARE (forward-looking active retrieval)
- Can include citations for each fact or section
8. Verify (Optional but Critical)
- Rule-based, symbolic, or LLM-based checks
- “LLM-as-a-judge” paradigm gaining traction
- Ensures factuality, coherence, and safety before output
💡 Use Cases Beyond Retrieval
Scenario | Description |
---|---|
Knowledge Injection | Main use: reduce hallucinations by grounding generation |
Memory Management | Simulate long-term memory by swapping relevant history |
Dynamic Few-Shot Learning | Retrieve best in-context examples on the fly |
Tool Selection | Help agents decide what external tools/APIs to call |
REALM-style Training | Integrate retrieval into model pretraining or finetuning |
⚔️ Comparison: RAG vs Fine-Tuning vs Prompt Engineering
Feature | Prompt Engineering | Fine-Tuning | RAG |
---|---|---|---|
Changes Weights | ❌ | ✅ | ❌ |
New Facts Injection | ❌ | ⚠️ (weak unless repeated) | ✅ |
Task-Specific Behavior | ⚠️ | ✅ | ⚠️ |
Cost/Infra Overhead | Low | High | Medium |
Latency | Low | Low | Higher (retrieval adds time) |
→ Best Practice: Combine fine-tuning (for tone, format) with RAG (for facts, updates).
⚠️ Limitations of RAG
Limitation | Details |
---|---|
Surface-Level Reasoning | Snippets don’t guarantee deep understanding |
Retrieval Bottlenecks | Garbage in, garbage out—retriever quality matters |
Context Constraints | Still bound by LLM’s input limits |
Latency Overhead | Real-time retrieval adds computational cost |
Ground Truth Dependence | RAG only works if high-quality data is available |
🧠 TL;DR Mental Model
RAG = LLM + External Memory + Smart Filtering
It doesn’t make the LLM smarter—it makes the LLM better grounded.