Retrieval-Augmented Generation (RAG)
RAG, or Retrieval-Augmented Generation, is an AI framework that enhances the accuracy and relevance of large language model (LLM) responses by integrating information retrieval from external knowledge sources before generating text.
- LLMs are combined with knowledge bases or document stores.
- Instead of cramming everything into the prompt, relevant context is retrieved in real-time and injected into the prompt dynamically.
I initially had the impression that it might be simply appending data to a prompt sting or simply finding and replacing (and that is a part of the process). The process of retrieving the data can be sophisticated by using vector search or similarity checks.
A REST API that fetches user data from a DB and inserts it into a prompt that is then passed to an LLM can be considered a basic or simplistic form of RAG. But it lacks the sophistication of semantic search and relies on exact match/structured query
RAG search is a hybrid approach that:
- retrieves relevant documents or data using vector search (semantic similarity)
- augments the prompt sent to the LLM with those documents
- generates a final answer that's grounded in context
RAG Pipeline (Simplified)
- Preprocess Data
- Split documents into chunks (e.g., 500 words)
- Generate embeddings for each chunk
- Store embeddings in a vector database
- At Query Time
- Embed the user’s question
- Use vector search to retrieve top-k similar chunks
- Inject those into the prompt:
Based on the following documents: [doc1], [doc2], ...Answer: "How do I register for sales tax in Quebec?"
- Send to LLM and get answer
✅ Why RAG is Useful
| Problem LLMs Have | How RAG Helps |
|---|---|
| Hallucinations | Grounds answers in real facts |
| Limited context window | Retrieves only relevant info |
| No access to custom data | Injects private/company data |
| Outdated model knowledge | Real-time retrieval from fresh sources |