Sổ Tay AI
ky-thuat Intermediate

What is RAG (Retrieval-Augmented Generation)?

A technique that lets an LLM look up your documents before answering — reducing hallucinations and grounding answers in real data.

Updated: May 2, 2026 · 1 min read

RAG (Retrieval-Augmented Generation) is a technique that lets an LLM consult a corpus of documents before answering a question.

How it works

  1. You have a corpus (PDFs, web pages, database records…)
  2. The system turns each chunk into an embedding (a number vector)
  3. When a user asks something, their question is also embedded
  4. Find the chunks whose vectors are closest to the question
  5. Stuff those chunks into the prompt → the LLM answers grounded in them

Why use RAG?

  • LLMs only know what was in their training data — RAG lets them use fresh, private, or proprietary information
  • LLMs hallucinate when they don’t know something — RAG forces them to lean on a source
  • You can’t fit terabytes of documents in a prompt — RAG only retrieves the relevant pieces

Example

A bank wants a customer-support chatbot. Instead of fine-tuning an LLM (expensive, slow), they use RAG: when a customer asks “What’s the 6-month deposit rate?”, the system pulls the latest rate sheet, hands it to the LLM, and the LLM answers with accurate numbers.

When to use RAG

  • You have an internal corpus the model needs
  • Answers should reflect up-to-date data
  • You need to reduce hallucinations

When NOT to use RAG

  • The question doesn’t need outside knowledge (“write me a leave-of-absence email”)
  • Your corpus is small (<100 pages) — just put it in the context window
  • You want consistent persona/style — that’s a job for fine-tuning
Tags
#rag#llm#vector-db