ky-thuat Intermediate
What is RAG (Retrieval-Augmented Generation)?
A technique that lets an LLM look up your documents before answering — reducing hallucinations and grounding answers in real data.
Updated: May 2, 2026 · 1 min read
RAG (Retrieval-Augmented Generation) is a technique that lets an LLM consult a corpus of documents before answering a question.
How it works
- You have a corpus (PDFs, web pages, database records…)
- The system turns each chunk into an embedding (a number vector)
- When a user asks something, their question is also embedded
- Find the chunks whose vectors are closest to the question
- Stuff those chunks into the prompt → the LLM answers grounded in them
Why use RAG?
- LLMs only know what was in their training data — RAG lets them use fresh, private, or proprietary information
- LLMs hallucinate when they don’t know something — RAG forces them to lean on a source
- You can’t fit terabytes of documents in a prompt — RAG only retrieves the relevant pieces
Example
A bank wants a customer-support chatbot. Instead of fine-tuning an LLM (expensive, slow), they use RAG: when a customer asks “What’s the 6-month deposit rate?”, the system pulls the latest rate sheet, hands it to the LLM, and the LLM answers with accurate numbers.
When to use RAG
- You have an internal corpus the model needs
- Answers should reflect up-to-date data
- You need to reduce hallucinations
When NOT to use RAG
- The question doesn’t need outside knowledge (“write me a leave-of-absence email”)
- Your corpus is small (<100 pages) — just put it in the context window
- You want consistent persona/style — that’s a job for fine-tuning
Related
Tags
#rag#llm#vector-db