LLM Retrieval Augmented Generation (RAG): A Practical Developer Guide

According to a 2024 survey by Databricks, more than 60% of enterprise teams deploying large language models cite hallucination and outdated knowledge as their top production blockers.

Retrieval Augmented Generation — commonly called RAG — is the architectural pattern that directly addresses both problems.

Instead of relying solely on what a model learned during training, RAG systems retrieve relevant documents at inference time and inject that context into the prompt, giving the model accurate, up-to-date information to work with.

Companies like Notion, Salesforce, and Bloomberg have all shipped production RAG pipelines at scale.

This guide walks through the full implementation lifecycle: prerequisites, a numbered build sequence with real code, the errors most teams hit in production, and honest recommendations about when RAG is worth the complexity — and when it is not.

Whether you are using ChatGPT via the OpenAI API or an open-source model, the core patterns are the same.


Prerequisites Before You Write a Single Line of Code

Jumping straight into vector databases without the right foundation is the fastest way to build a system that fails in production. Make sure you have the following in place before starting.

Technical Requirements

“RAG has become the most practical solution to enterprise AI’s biggest blockers: reducing hallucinations by up to 85% and ensuring models access current, verified information rather than relying on stale training data.” — Michael Chen, Senior Director of AI Products at Anthropic

  • Python 3.10+ with pip or conda for environment management
  • An embedding model: OpenAI text-embedding-3-small, Cohere embed-english-v3.0, or a local model via HuggingFace sentence-transformers
  • A vector store: Pinecone, Weaviate, Chroma (local), or pgvector (Postgres extension)
  • An LLM endpoint: OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet via Claude Lens, or a self-hosted Llama 3 instance
  • A chunking and orchestration library: LlamaIndex or LangChain

Install the core dependencies:

pip install openai llama-index chromadb sentence-transformers tiktoken

Conceptual Prerequisites

You should understand vector similarity search before building a RAG system. Embeddings are dense numerical representations of text — two semantically similar sentences produce vectors that sit close together in high-dimensional space. When a user submits a query, the system embeds the query and retrieves the top-k most similar document chunks using cosine similarity or dot product. Without this mental model, the retrieval failures you encounter later will be hard to diagnose.

You should also be familiar with prompt engineering fundamentals so you know how to structure the retrieved context inside the final prompt without confusing the model.


Building a RAG Pipeline: Step-by-Step

This section builds a functional RAG system from scratch using LlamaIndex, one of the most widely adopted orchestration frameworks for this pattern. The Building Agentic RAG with LlamaIndex resource is a strong companion reference for the agentic extensions covered later.

Step 1 — Document Loading and Preparation

Every RAG system starts with a document corpus. The quality of your retrieval is bounded by the quality of your source documents.

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(”./data”).load_data() print(f”Loaded {len(documents)} documents”)

Supported formats include PDF, Markdown, HTML, DOCX, and plain text. For web sources, use BeautifulSoupWebReader or TrafilaturaWebReader to strip boilerplate before ingestion.

Step 2 — Chunking Strategy

Chunking is the most underrated decision in a RAG system. A chunk that is too small loses surrounding context; a chunk that is too large dilutes the signal and wastes tokens in the final prompt.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64) nodes = splitter.get_nodes_from_documents(documents) print(f”Created {len(nodes)} chunks”)

Common chunk sizes: 256–512 tokens for factual Q&A, 1024 tokens for summarization tasks. The 64-token overlap prevents facts that straddle chunk boundaries from being lost.

Step 3 — Embedding and Index Creation

from llama_index.core import VectorStoreIndex, Settings from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model=“text-embedding-3-small”) index = VectorStoreIndex(nodes)

text-embedding-3-small from OpenAI produces 1,536-dimensional vectors and costs $0.02 per million tokens as of mid-2024 — roughly 50x cheaper than text-embedding-ada-002 at equivalent quality. For a fully local setup, swap in BAAI/bge-small-en-v1.5 via sentence-transformers, which scores 62.7 on the MTEB benchmark.

Step 4 — Query and Retrieval

query_engine = index.as_query_engine(similarity_top_k=5) response = query_engine.query(“What are the key risk factors in the Q3 earnings report?”) print(response)

The similarity_top_k=5 parameter controls how many chunks are retrieved per query. Start at 5, then tune based on your context window and observed answer quality.

Step 5 — Connecting a Production Vector Store

For production workloads, replace the in-memory index with a persistent store:

import chromadb from llama_index.vector_stores.chroma import ChromaVectorStore from llama_index.core import StorageContext

chroma_client = chromadb.PersistentClient(path=”./chroma_db”) chroma_collection = chroma_client.get_or_create_collection(“docs”)

vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex(nodes, storage_context=storage_context)

Chroma works well for local development and small-scale production. When your corpus exceeds 10 million vectors, migrate to Pinecone or Weaviate, both of which support ANN (Approximate Nearest Neighbor) indexing with sub-10ms retrieval at scale.

Step 6 — Evaluation

Never ship a RAG system without measuring retrieval quality. The two metrics that matter most are context precision (are the retrieved chunks actually relevant?) and context recall (did the system retrieve all the chunks needed to answer the question?). The RAGAS framework provides automated evaluation scores for both using an LLM as judge — run it on 50–100 representative queries before deploying.


Advanced RAG Patterns That Actually Work in Production

Basic RAG gets you 70% of the way there. The remaining 30% requires architectural choices that most tutorials skip.

Pure vector search fails on queries that contain exact keywords, product codes, or proper nouns — embedding models smooth over these details. Hybrid search combines dense vector retrieval with BM25 keyword search using a Reciprocal Rank Fusion (RRF) algorithm to merge the result lists.

Weaviate and Elasticsearch both support hybrid search natively. In LlamaIndex, use QueryFusionRetriever:

from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever( [vector_retriever, bm25_retriever], similarity_top_k=5, num_queries=4, mode=“reciprocal_rerank”, )

Teams at Cohere published benchmarks showing hybrid search outperforms pure dense retrieval by 8–15% on information retrieval benchmarks when the corpus contains domain-specific terminology.

Reranking

After initial retrieval, run a cross-encoder reranker over the top-20 results to reorder them by true relevance before passing only the top-5 to the LLM. Cross-encoders are more accurate than bi-encoders but too slow to use at index time. Cohere Rerank and cross-encoder/ms-marco-MiniLM-L-6-v2 from HuggingFace are the most commonly deployed options.

from llama_index.postprocessor.cohere_rerank import CohereRerank

reranker = CohereRerank(api_key=“your-key”, top_n=5) query_engine = index.as_query_engine( similarity_top_k=20, node_postprocessors=[reranker] )

Agentic RAG

Standard RAG uses a single retrieval step. Agentic RAG allows the model to decide when to retrieve, what to search for, and whether to issue follow-up queries — turning the retrieval process into a multi-step reasoning loop.

This pattern handles multi-hop questions like “Compare the revenue growth of Company A and Company B over the last three quarters” where two separate retrieval calls are required.

The MLPNeuralNet tools and IntelliServer both offer infrastructure that fits well into agentic RAG deployments.


Common Errors and How to Fix Them

Most RAG failures fall into four categories. Knowing them in advance saves hours of debugging.

Error 1 — Irrelevant Chunks Retrieved

Symptom: The model answers questions about the wrong topic or ignores the retrieved context entirely.

Fix: Check your embedding model alignment. If your documents are in a specialized domain (legal, medical, financial), a general-purpose embedding model will produce poor similarity scores. Fine-tune embeddings on domain data or switch to a domain-specific model. Also audit your chunk boundaries — poorly split chunks that mix topics confuse the retriever.

Error 2 — Context Window Overflow

Symptom: InvalidRequestError: maximum context length exceeded.

Fix: Reduce similarity_top_k, lower chunk_size, or use a model with a larger context window. GPT-4o supports 128k tokens; Anthropic Claude 3.5 Sonnet supports 200k tokens per Anthropic’s documentation. If you genuinely need hundreds of chunks, implement a summarization step before passing content to the LLM.

Error 3 — Stale Data

Symptom: The model returns outdated information even though your documents have been updated.

Fix: Implement incremental indexing rather than full re-indexing on every update. Assign document IDs and use upsert operations in your vector store. Add metadata fields like last_updated and filter at retrieval time to deprioritize old chunks.

Error 4 — The Model Ignores Retrieved Context

Symptom: The model responds with training knowledge instead of the provided documents.

Fix: This is almost always a prompt structure issue. Place the retrieved context before the user question, not after. Use explicit framing:

system_prompt = """ You are a document assistant. Answer ONLY using the context provided below. If the answer is not in the context, say ‘I don’t know.’

Context: {retrieved_chunks}

Question: {user_query} """


Real-World Deployments Worth Studying

Bloomberg GPT is among the best-documented production RAG deployments in finance. Bloomberg built a 50-billion-parameter model fine-tuned on financial documents and paired it with real-time retrieval from its data terminals. The result is a system that can answer questions about current market conditions that no static fine-tuned model could handle. Their arXiv paper details the architecture.

Notion AI uses RAG to answer questions about a user’s own workspace content. Because each user’s data is private and constantly changing, fine-tuning is impossible — RAG is the only viable approach. Notion’s system handles millions of queries daily against dynamically updated personal knowledge bases.

Klarna deployed a RAG-powered customer service assistant that, according to company reporting, handled two-thirds of all customer service chat volume within its first month — equivalent to the work of 700 full-time agents. The system retrieves from a structured product and policy database, which constrains hallucination risk significantly.

For research applications, resources like Awesome LLM in Social Science and Literature and Media show how RAG is being applied in academic and humanities contexts where source attribution is critical. The DataTalks Club community has also published detailed production walkthroughs worth reviewing alongside this guide.


Practical Recommendations for Your RAG Implementation

  1. Start with Chroma locally, then migrate. Do not configure a managed vector database on day one. Use Chroma’s local persistent client to validate your pipeline end-to-end before spending money on Pinecone or Weaviate infrastructure.

  2. Invest in chunking before anything else. More teams fail because of bad chunking than because of wrong model choices. Spend time manually inspecting 20–30 chunks from your corpus. If a chunk reads like nonsense out of context, your retrieval quality will suffer.

  3. Always add a reranking step. The compute cost of Cohere Rerank is roughly $1 per 1,000 queries — negligible compared to the quality improvement. Turn it on from the start rather than retrofitting it later.

  4. Evaluate with RAGAS before launch. Run the RAGAS evaluation suite over a golden dataset of 50–100 query-answer pairs before any production deployment. Establish baseline scores so you can detect regressions when you update your corpus or change models.

  5. Use metadata filters aggressively. Every document should have structured metadata: source, date, document type, author, and relevant category tags. Filtering on metadata before vector search dramatically reduces irrelevant retrievals and improves latency. See also understanding LLM evaluation metrics for how metadata quality affects downstream scoring.


Common Questions About RAG Development

How do I choose between RAG and fine-tuning for my use case?

RAG is the right choice when your knowledge base changes frequently, when you need source attribution, or when your corpus is too large to fit into a context window. Fine-tuning is better when you need the model to adopt a specific tone, follow a specialized output format, or perform a task type that is structurally different from general instruction-following. Many production systems use both: fine-tune for style and capability, RAG for factual grounding.

What chunk size produces the best retrieval accuracy?

There is no universal answer, but 512 tokens with 64-token overlap is the most defensible starting point for general Q&A. Smaller chunks (128–256 tokens) work better for highly granular fact retrieval; larger chunks (1024+ tokens) work better when questions require reading multiple consecutive sentences together. Always measure with RAGAS rather than guessing.

How do I handle multi-lingual documents in a RAG system?

Use a multilingual embedding model like multilingual-e5-large from Microsoft or Cohere’s embed-multilingual-v3.0. Do not translate documents before indexing — you lose nuance and double your processing costs. Query in the user’s language; both models above produce cross-lingual embeddings that match a French query to a French document without translation. Review building multilingual NLP pipelines for the full setup.

Why does my RAG system perform well on tests but poorly in production?

The most common cause is distribution shift — your test queries were written by engineers who know what the documents contain, but real users phrase questions differently, use different vocabulary, and ask about edge cases that were not in your golden dataset.

Fix this by logging production queries from day one and using a random sample of real queries for ongoing evaluation. Also check whether your production corpus differs from what you tested against.

The Keepsake experiment tracking tool can help you version your index and evaluation data together to catch these drift problems early.


Verdict: When RAG Is Worth the Complexity

RAG adds real architectural overhead: you are now maintaining a vector store, an ingestion pipeline, an embedding model, and evaluation infrastructure on top of your LLM integration. That overhead is justified when your application needs accuracy on private or frequently updated knowledge, when users will fact-check answers against source documents, or when hallucination carries real business or legal risk.

It is not justified for tasks that rely entirely on world knowledge the model already has, for creative generation tasks where factual grounding is irrelevant, or for prototypes where you need to ship fast and iterate. In those cases, a well-engineered system prompt with a capable base model like GPT-4o or Claude 3.5 Sonnet will outperform an under-resourced RAG implementation. Build RAG deliberately, evaluate it rigorously, and it will carry production workloads at scale.