📘 Knowledge Item: Retrieval-Augmented Generation (RAG) – Synthesized Summary

Source: Galileo – Mastering RAG (NotebookLM excerpts)

Category: AI Engineering, Applied LLMs

Level: Intermediate–Advanced

🔍 What is RAG?

Retrieval-Augmented Generation (RAG) is a method for enhancing a Large Language Model’s output by supplying it with external, relevant information at inference time.

It sits at the intersection of search and generation, combining retrieval models with LLMs to improve factual accuracy, reduce hallucinations, and enable dynamic knowledge injection.

🚨 Why RAG?

Problem	How RAG Helps
Hallucinations	Provides factual grounding from external sources
Data Gaps	Bridges access to private, proprietary, or up-to-date data
Limited Context Windows	Retrieves only the most relevant chunks
High Cost of Fine-tuning	Avoids re-training by retrieving new info dynamically
Rapidly Changing Environments	Continuously queries live knowledge sources

🧱 RAG System Architecture

🔁 Core Flow: Retrieve → Read (Generate)

Advanced production systems may use a full pipeline:

Rewrite → Retrieve → Rerank → Refine → Insert → Generate (+ Verify)

🔬 Pipeline Breakdown

1. Data Preparation

Chunking: Break large texts into manageable parts based on structure
Embedding: Encode chunks using models like Sentence Transformers, BGE, or OpenAI
Vector DB: Store chunks in vector stores (e.g. Pinecone, Weaviate)
Metadata: Attach tags (e.g. topic, author) for filtered search

2. Query Rewrite

Rephrase unclear or verbose queries
Inject conversational history
Use Chain-of-Thought or ReAct-style prompting

3. Retrieval

Dense (embedding-based): Semantic similarity via vector search (e.g. HNSW)
Sparse (keyword-based): Traditional IR (e.g. BM25)
Hybrid: Combine both for broader recall
Generative: LLM generates document IDs (low scale)
GraphRAG: Query over knowledge graphs extracted from text

4. Rerank

Use cross-encoders or LLMs as rerankers
Measure query-document relevance more deeply
Reduce hallucination and boost in-domain quality

5. Refine

Summarize retrieved text (extractive or abstractive)
Use CoN (Chain-of-Note) to integrate multi-document relevance
Shrink for context window limits

6. Insert

Strategically place retrieved data into LLM prompt
Maintain coherence, relevance, and logical flow

7. Generate

LLM produces final output
May use interleaved generation: e.g. FLARE (forward-looking active retrieval)
Can include citations for each fact or section

8. Verify (Optional but Critical)

Rule-based, symbolic, or LLM-based checks
“LLM-as-a-judge” paradigm gaining traction
Ensures factuality, coherence, and safety before output

💡 Use Cases Beyond Retrieval

Scenario	Description
Knowledge Injection	Main use: reduce hallucinations by grounding generation
Memory Management	Simulate long-term memory by swapping relevant history
Dynamic Few-Shot Learning	Retrieve best in-context examples on the fly
Tool Selection	Help agents decide what external tools/APIs to call
REALM-style Training	Integrate retrieval into model pretraining or finetuning

⚔️ Comparison: RAG vs Fine-Tuning vs Prompt Engineering

Feature	Prompt Engineering	Fine-Tuning	RAG
Changes Weights	❌	✅	❌
New Facts Injection	❌	⚠️ (weak unless repeated)	✅
Task-Specific Behavior	⚠️	✅	⚠️
Cost/Infra Overhead	Low	High	Medium
Latency	Low	Low	Higher (retrieval adds time)

→ Best Practice: Combine fine-tuning (for tone, format) with RAG (for facts, updates).

⚠️ Limitations of RAG

Limitation	Details
Surface-Level Reasoning	Snippets don’t guarantee deep understanding
Retrieval Bottlenecks	Garbage in, garbage out—retriever quality matters
Context Constraints	Still bound by LLM’s input limits
Latency Overhead	Real-time retrieval adds computational cost
Ground Truth Dependence	RAG only works if high-quality data is available

🧠 TL;DR Mental Model

RAG = LLM + External Memory + Smart Filtering

It doesn’t make the LLM smarter—it makes the LLM better grounded.

🔍 What is RAG?​

🚨 Why RAG?​

🧱 RAG System Architecture​

🔁 Core Flow: Retrieve → Read (Generate)​

🔬 Pipeline Breakdown​

1. Data Preparation​

2. Query Rewrite​

3. Retrieval​

4. Rerank​

5. Refine​

6. Insert​

7. Generate​

8. Verify (Optional but Critical)​

💡 Use Cases Beyond Retrieval​

⚔️ Comparison: RAG vs Fine-Tuning vs Prompt Engineering​

⚠️ Limitations of RAG​

🧠 TL;DR Mental Model​