ky-thuat Intermediate

What is RAG (Retrieval-Augmented Generation)?

A technique that lets an LLM look up your documents before answering — reducing hallucinations and grounding answers in real data.

Updated: May 2, 2026 · 1 min read

RAG (Retrieval-Augmented Generation) is a technique that lets an LLM consult a corpus of documents before answering a question.

How it works

You have a corpus (PDFs, web pages, database records…)
The system turns each chunk into an embedding (a number vector)
When a user asks something, their question is also embedded
Find the chunks whose vectors are closest to the question
Stuff those chunks into the prompt → the LLM answers grounded in them

Why use RAG?

LLMs only know what was in their training data — RAG lets them use fresh, private, or proprietary information
LLMs hallucinate when they don’t know something — RAG forces them to lean on a source
You can’t fit terabytes of documents in a prompt — RAG only retrieves the relevant pieces

Example

A bank wants a customer-support chatbot. Instead of fine-tuning an LLM (expensive, slow), they use RAG: when a customer asks “What’s the 6-month deposit rate?”, the system pulls the latest rate sheet, hands it to the LLM, and the LLM answers with accurate numbers.

When to use RAG

You have an internal corpus the model needs
Answers should reflect up-to-date data
You need to reduce hallucinations

When NOT to use RAG

The question doesn’t need outside knowledge (“write me a leave-of-absence email”)
Your corpus is small (<100 pages) — just put it in the context window
You want consistent persona/style — that’s a job for fine-tuning