intermediate6 sectionsUpdated Jun 15, 2025

RAG & Agentic RAG

Retrieval-augmented generation and its evolution into agentic systems with hierarchical retrieval.

What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG), introduced by Lewis et al. (2020), is a technique that enhances LLM responses by retrieving relevant information from external knowledge bases and injecting it into the model's context before generation. Instead of relying solely on the model's training data (which may be outdated or incomplete), RAG gives the model access to current, specific, and authoritative information.

The motivation is simple: LLMs have broad knowledge but lack specific, up-to-date information about your company's products, your internal documentation, or yesterday's news. RAG bridges this gap by connecting the model to your data.

A basic RAG pipeline has three stages:

  1. Indexing — Your documents are split into chunks, converted to vector embeddings, and stored in a vector database. This is a one-time (or periodic) offline process.
  2. Retrieval — When a user asks a question, the query is embedded and used to find the most semantically similar document chunks in the vector store.
  3. Generation — The retrieved chunks are injected into the LLM's prompt alongside the user's question. The model generates a response grounded in the retrieved context.
RAG Pipeline: Query to Embed to Retrieve to Augment to Generate RAG Pipeline: Query to Embed to Retrieve to Augment to Generate
Fig. The five-step RAG pipeline. Steps ①–⑤ run at query time; the vector DB is pre-populated offline.
# Basic RAG pipeline
def rag_query(question, vector_store, llm):
    # Step 1: Retrieve relevant context
    query_embedding = embed(question)
    relevant_chunks = vector_store.similarity_search(
        query_embedding, top_k=5
    )

    # Step 2: Build augmented prompt
    context = "\n\n".join([chunk.text for chunk in relevant_chunks])
    prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {question}
Answer:"""

    # Step 3: Generate response
    return llm.generate(prompt)

Chunking Strategies

How you split documents into chunks is one of the most impactful decisions in RAG system design. Poor chunking leads to poor retrieval, which leads to irrelevant or incomplete answers. Here are the main strategies:

Fixed-Size Chunking

Split documents into chunks of a fixed token count (e.g., 512 tokens) with optional overlap. Simple to implement but often splits mid-sentence or mid-paragraph, losing context. Best used as a baseline.

Semantic Chunking

Split at natural boundaries — paragraphs, sections, headers. This preserves the logical structure of the document. More effective than fixed-size but requires parsing the document structure.

Recursive Chunking

Start by trying to split on the largest boundaries (sections), then fall back to smaller boundaries (paragraphs, sentences) if the resulting chunks are too large. This is the default strategy in LangChain and produces good results across document types.

LLM-Assisted Chunking

Also called proposition-based chunking (Chen et al., 2023), this approach uses an LLM to decide how to chunk the document. The model reads the document and identifies the optimal split points based on topic and context. Most expensive but produces the highest quality chunks for complex documents.

Practical guidelines for chunking:

  • Chunk size: 200-800 tokens is the typical range. Smaller chunks improve precision (each chunk is more focused) but may lose context. Larger chunks preserve context but may include irrelevant information.
  • Overlap: 10-20% overlap between consecutive chunks prevents losing information at boundaries.
  • Metadata: Always attach metadata to chunks (source document, section title, page number, date). This enables filtering and attribution.
  • Test empirically: The optimal strategy depends on your specific documents and queries. Test different approaches with representative questions and measure retrieval quality.

Embedding Models and Retrieval Methods

The embedding model converts text into numerical vectors that capture semantic meaning. Two texts about the same topic will have similar vectors, even if they use different words. Choosing the right embedding model and retrieval strategy is critical for RAG quality.

Embedding Models

Common choices include:

  • OpenAI text-embedding-3-large — High quality, widely used, API-based.
  • Cohere embed-english-v3.0 / embed-v4.0 — Strong multilingual support, document-optimized.
  • BGE-M3 / GTE-large — Open-source models that rival commercial options. Can be self-hosted.
  • Voyage AI — Domain-specific models for code, legal, finance.

Retrieval Methods

Dense retrieval (vector search) — The standard approach. Embed the query, find the nearest vectors. Great for semantic similarity but can miss exact keyword matches.

Sparse retrieval (BM25/TF-IDF) — Traditional keyword-based search. Fast and excellent for exact matches, acronyms, and proper nouns that embedding models may miss.

Hybrid retrieval — Combine dense and sparse retrieval, then merge the results. This is the current best practice because it captures both semantic meaning and keyword relevance. Most production RAG systems use hybrid retrieval.

Re-Ranking

After initial retrieval, a re-ranker model re-scores the candidates using a more powerful cross-encoder model. The re-ranker sees the query and each document together (unlike bi-encoders used for initial retrieval), so it can make more nuanced relevance judgments. Cohere Rerank and cross-encoder models from Hugging Face are popular choices. Re-ranking typically improves retrieval quality by 10-30%.

# Hybrid retrieval with re-ranking
def hybrid_search(query, vector_store, bm25_index, reranker):
    # Dense retrieval
    dense_results = vector_store.similarity_search(query, top_k=20)

    # Sparse retrieval
    sparse_results = bm25_index.search(query, top_k=20)

    # Merge and deduplicate
    candidates = merge_results(dense_results, sparse_results)

    # Re-rank with cross-encoder
    ranked = reranker.rank(query, candidates, top_k=5)

    return ranked

Advanced Retrieval Techniques

Beyond basic dense and sparse retrieval, several advanced techniques can significantly improve RAG quality by addressing the fundamental mismatch between how users phrase queries and how relevant information is stored.

HyDE (Hypothetical Document Embeddings)

Introduced by Gao et al. (2022), HyDE addresses the query-document asymmetry problem. Instead of embedding the raw user query (which is typically short and may use different vocabulary than the target documents), HyDE prompts the LLM to generate a hypothetical answer to the question. This hypothetical document is then embedded and used for retrieval. Because the generated text is closer in style and vocabulary to actual documents in the corpus, it often retrieves more relevant results than the raw query would. HyDE is particularly effective for complex or abstract questions where the user's phrasing differs significantly from the language used in the knowledge base.

Parent Document Retrieval

Parent document retrieval decouples the unit of retrieval from the unit of context passed to the LLM. Small chunks (e.g., 200 tokens) are indexed for precise retrieval, but when a small chunk matches, the system returns its parent document or a larger surrounding window (e.g., 2000 tokens) to the LLM. This gives the model richer context for generation while maintaining retrieval precision. This technique is especially useful for documents where meaning depends on surrounding context, such as legal contracts, technical manuals, or narrative text.

Contextual Retrieval

Introduced by Anthropic (2024), Contextual Retrieval addresses the problem that individual chunks often lose important context when separated from their source document. Before embedding, each chunk is preprocessed by an LLM that generates a concise context summary — explaining where the chunk fits within the broader document and what key entities or topics it relates to. This context is prepended to the chunk before embedding. The result is that embeddings capture not just the chunk's content but its role and meaning within the larger document, leading to significantly improved retrieval accuracy.

Agentic RAG: Agent-Driven Retrieval

Agentic RAG transforms RAG from a fixed pipeline into a dynamic, agent-driven process. Instead of a simple "retrieve then generate" flow, an agent decides what to retrieve, when to retrieve it, how to retrieve it, and whether the retrieved information is sufficient.

In basic RAG, the pipeline is rigid: embed the query, retrieve top-k chunks, generate. The user's question goes in, a response comes out, and there is no adaptation or iteration. Agentic RAG puts an intelligent agent in control of this process.

Key capabilities of agentic RAG:

Query planning — The agent analyzes the user's question and decides whether it needs one search or several. "What were our Q3 revenue drivers compared to competitors?" might require three separate searches: Q3 revenue data, internal analysis reports, and competitor financial reports.

Query reformulation — If the initial retrieval yields poor results, the agent reformulates the query and tries again. It might rephrase the question, use different keywords, or search a different data source entirely.

Source routing — The agent selects which knowledge base to search based on the question type. Technical questions go to the engineering docs, policy questions go to the HR knowledge base, and financial questions go to the analytics database.

Iterative refinement — The agent reads retrieved documents, identifies gaps in its knowledge, and performs additional retrievals to fill those gaps. This continues until it has enough information to answer comprehensively.

# Agentic RAG loop
def agentic_rag(question, agent, knowledge_bases):
    context = []

    # Agent plans retrieval strategy
    plan = agent.plan_retrieval(question)

    for step in plan.steps:
        # Agent picks the right source and query
        source = step.knowledge_base
        query = step.reformulated_query

        results = knowledge_bases[source].search(query)
        context.extend(results)

        # Agent evaluates if it has enough information
        assessment = agent.evaluate_sufficiency(question, context)
        if assessment.needs_more_info:
            plan.add_steps(assessment.additional_queries)
        elif assessment.sufficient:
            break

    return agent.generate_answer(question, context)

Corrective RAG and Self-RAG

Two important advances push RAG reliability even further by adding self-correction mechanisms:

Corrective RAG (CRAG)

Corrective RAG adds a retrieval evaluator that assesses the quality of retrieved documents before passing them to the generator. If the evaluator determines the retrieved documents are not relevant enough, the system takes corrective action.

The CRAG workflow:

  1. Retrieve documents as usual.
  2. A relevance evaluator scores each document as Correct, Ambiguous, or Incorrect.
  3. If documents are Correct — proceed to generation normally.
  4. If documents are Ambiguous — refine the query, filter irrelevant passages from retrieved documents, and supplement with web search if needed.
  5. If documents are Incorrect — fall back to web search or inform the user that the knowledge base does not contain relevant information.

Self-RAG

Self-RAG goes further by having the model evaluate its own output. The model generates special critique tokens (called “special tokens” in the original paper by Asai et al., 2023) that assess whether retrieval is needed, whether the retrieved content is relevant, and whether the generated answer is supported by the evidence.

Self-RAG makes four decisions during generation:

  • Retrieve? — Does this question need external knowledge, or can I answer from my training data?
  • Relevant? — Are the retrieved documents actually relevant to the question?
  • Supported? — Is my generated answer fully supported by the retrieved evidence?
  • Useful? — Is my answer actually useful and complete for the user?

If any check fails, the system loops back: re-retrieves, re-generates, or asks for clarification. This self-correcting behavior dramatically reduces hallucination and improves answer quality, making Self-RAG one of the most reliable RAG architectures available.

Both Corrective RAG and Self-RAG are natural fits for agentic architectures — the agent's reasoning capabilities power the evaluation and correction decisions, turning RAG from a static pipeline into an adaptive, self-improving system.

Key Takeaways

  • 1RAG enhances LLM responses by retrieving relevant information from external knowledge bases and injecting it into the model's context.
  • 2Chunking strategy significantly impacts retrieval quality; use semantic or recursive chunking with 200-800 token chunks and 10-20% overlap.
  • 3Hybrid retrieval (combining dense vector search and sparse keyword search) with re-ranking is the current best practice for production RAG.
  • 4Agentic RAG puts an intelligent agent in control of retrieval, enabling query planning, reformulation, source routing, and iterative refinement.
  • 5Corrective RAG evaluates retrieved documents and takes corrective action when they are not relevant enough.
  • 6Self-RAG has the model evaluate its own output with reflection tokens, creating a self-correcting generation process.

Explore Related Content