Retrieval-Augmented Generation (RAG)

RAG, or Retrieval-Augmented Generation, is an AI framework that enhances the accuracy and relevance of large language model (LLM) responses by integrating information retrieval from external knowledge sources before generating text.

LLMs are combined with knowledge bases or document stores.
Instead of cramming everything into the prompt, relevant context is retrieved in real-time and injected into the prompt dynamically.
I initially had the impression that it might be simply appending data to a prompt sting or simply finding and replacing (and that is a part of the process). The process of retrieving the data can be sophisticated by using vector search or similarity checks.

A REST API that fetches user data from a DB and inserts it into a prompt that is then passed to an LLM can be considered a basic or simplistic form of RAG. But it lacks the sophistication of semantic search and relies on exact match/structured query

RAG search is a hybrid approach that:

retrieves relevant documents or data using vector search (semantic similarity)
augments the prompt sent to the LLM with those documents
generates a final answer that's grounded in context

RAG Pipeline (Simplified)

Preprocess Data
- Split documents into chunks (e.g., 500 words)
- Generate embeddings for each chunk
- Store embeddings in a vector database
At Query Time
- Embed the user’s question
- Use vector search to retrieve top-k similar chunks
- Inject those into the prompt: Based on the following documents: [doc1], [doc2], ... Answer: "How do I register for sales tax in Quebec?"
Send to LLM and get answer

✅ Why RAG is Useful

Problem LLMs Have	How RAG Helps
Hallucinations	Grounds answers in real facts
Limited context window	Retrieves only relevant info
No access to custom data	Injects private/company data
Outdated model knowledge	Real-time retrieval from fresh sources