Skip to main content

📘 Knowledge Item: Retrieval-Augmented Generation (RAG) – Synthesized Summary

Source: Galileo – Mastering RAG (NotebookLM excerpts)

Category: AI Engineering, Applied LLMs

Level: Intermediate–Advanced


🔍 What is RAG?

Retrieval-Augmented Generation (RAG) is a method for enhancing a Large Language Model’s output by supplying it with external, relevant information at inference time.

It sits at the intersection of search and generation, combining retrieval models with LLMs to improve factual accuracy, reduce hallucinations, and enable dynamic knowledge injection.


🚨 Why RAG?

ProblemHow RAG Helps
HallucinationsProvides factual grounding from external sources
Data GapsBridges access to private, proprietary, or up-to-date data
Limited Context WindowsRetrieves only the most relevant chunks
High Cost of Fine-tuningAvoids re-training by retrieving new info dynamically
Rapidly Changing EnvironmentsContinuously queries live knowledge sources

🧱 RAG System Architecture

🔁 Core Flow: Retrieve → Read (Generate)

Advanced production systems may use a full pipeline:

Rewrite → Retrieve → Rerank → Refine → Insert → Generate (+ Verify)


🔬 Pipeline Breakdown

1. Data Preparation

  • Chunking: Break large texts into manageable parts based on structure
  • Embedding: Encode chunks using models like Sentence Transformers, BGE, or OpenAI
  • Vector DB: Store chunks in vector stores (e.g. Pinecone, Weaviate)
  • Metadata: Attach tags (e.g. topic, author) for filtered search

2. Query Rewrite

  • Rephrase unclear or verbose queries
  • Inject conversational history
  • Use Chain-of-Thought or ReAct-style prompting

3. Retrieval

  • Dense (embedding-based): Semantic similarity via vector search (e.g. HNSW)
  • Sparse (keyword-based): Traditional IR (e.g. BM25)
  • Hybrid: Combine both for broader recall
  • Generative: LLM generates document IDs (low scale)
  • GraphRAG: Query over knowledge graphs extracted from text

4. Rerank

  • Use cross-encoders or LLMs as rerankers
  • Measure query-document relevance more deeply
  • Reduce hallucination and boost in-domain quality

5. Refine

  • Summarize retrieved text (extractive or abstractive)
  • Use CoN (Chain-of-Note) to integrate multi-document relevance
  • Shrink for context window limits

6. Insert

  • Strategically place retrieved data into LLM prompt
  • Maintain coherence, relevance, and logical flow

7. Generate

  • LLM produces final output
  • May use interleaved generation: e.g. FLARE (forward-looking active retrieval)
  • Can include citations for each fact or section

8. Verify (Optional but Critical)

  • Rule-based, symbolic, or LLM-based checks
  • “LLM-as-a-judge” paradigm gaining traction
  • Ensures factuality, coherence, and safety before output

💡 Use Cases Beyond Retrieval

ScenarioDescription
Knowledge InjectionMain use: reduce hallucinations by grounding generation
Memory ManagementSimulate long-term memory by swapping relevant history
Dynamic Few-Shot LearningRetrieve best in-context examples on the fly
Tool SelectionHelp agents decide what external tools/APIs to call
REALM-style TrainingIntegrate retrieval into model pretraining or finetuning

⚔️ Comparison: RAG vs Fine-Tuning vs Prompt Engineering

FeaturePrompt EngineeringFine-TuningRAG
Changes Weights
New Facts Injection⚠️ (weak unless repeated)
Task-Specific Behavior⚠️⚠️
Cost/Infra OverheadLowHighMedium
LatencyLowLowHigher (retrieval adds time)

Best Practice: Combine fine-tuning (for tone, format) with RAG (for facts, updates).


⚠️ Limitations of RAG

LimitationDetails
Surface-Level ReasoningSnippets don’t guarantee deep understanding
Retrieval BottlenecksGarbage in, garbage out—retriever quality matters
Context ConstraintsStill bound by LLM’s input limits
Latency OverheadReal-time retrieval adds computational cost
Ground Truth DependenceRAG only works if high-quality data is available

🧠 TL;DR Mental Model

RAG = LLM + External Memory + Smart Filtering

It doesn’t make the LLM smarter—it makes the LLM better grounded.