RAG Architecture Patterns on AWS Bedrock: Naive, Advanced, and Agentic

By adam@efsnetworks.com

March 12, 2026

Key takeaway: RAG (Retrieval-Augmented Generation) on AWS Bedrock comes in three distinct architectural tiers — naive, advanced, and agentic — each with different complexity, cost, and performance characteristics. Choosing the right tier for your use case is the most consequential early architectural decision, and most teams default to the wrong one.

Retrieval-Augmented Generation is now the default pattern for enterprise AI applications that need to operate on proprietary data without the risk, cost, and latency of full model fine-tuning. The core idea is straightforward: instead of baking organizational knowledge into model weights, you retrieve relevant context at query time and inject it into the prompt. The model reasons over your data without ever having been trained on it.

In practice, "RAG on AWS Bedrock" describes a spectrum of architectures with meaningfully different characteristics. This guide covers the three main tiers EFS AI works with in production — what distinguishes them, when each is appropriate, and the specific AWS services and configuration decisions that matter most for each.

The Three RAG Tiers

Tier	Retrieval Mechanism	Query Handling	Complexity	Best For
Naive RAG	Single vector similarity search	One-shot retrieval then generate	Low	Single-topic knowledge bases, FAQ, document Q&A
Advanced RAG	Hybrid search, re-ranking, query rewriting	Multi-step retrieval with scoring	Medium	Multi-domain corpora, conversational assistants, document analysis
Agentic RAG	Dynamic tool selection across multiple retrieval sources	Multi-hop reasoning, iterative retrieval	High	Autonomous workflows, multi-system integrations, complex Q&A

Tier 1: Naive RAG

Naive RAG is the simplest working implementation: embed your documents, store the vectors, embed the query, retrieve the top-k most similar chunks, stuff them into the prompt, and generate. It works well enough for narrow, well-scoped knowledge bases where the user's question closely resembles the content being retrieved.

On AWS Bedrock, a typical naive RAG stack looks like:

Embedding model: amazon.titan-embed-text-v2 or cohere.embed-english-v3 via Bedrock
Vector store: Amazon OpenSearch Serverless with k-NN index, or PostgreSQL with pgvector extension on RDS
Retrieval: Approximate nearest neighbor (ANN) search, top-5 chunks
Generation: Claude 3 Sonnet or Haiku on Bedrock, depending on latency/cost requirements
Orchestration: Bedrock Knowledge Bases handles ingestion, chunking, and retrieval automatically

Chunking strategy for naive RAG: Fixed-size chunks (512-1024 tokens) with 10-20% overlap work adequately when documents are relatively uniform in structure. For mixed-format corpora (PDFs, HTML, structured tables), fixed-size chunking degrades retrieval precision — this is usually the first sign that you need to move to advanced RAG.

Where naive RAG breaks down: Precision drops when users ask questions that require synthesizing information across multiple document sections, when the query vocabulary differs significantly from document vocabulary (lexical gap), or when the corpus grows beyond a few thousand documents and noise in the top-k results increases.

Tier 2: Advanced RAG

Advanced RAG adds a set of precision techniques around the basic retrieve-and-generate loop. In our production implementations, the most impactful techniques — roughly in order of improvement per unit of added complexity — are:

Hybrid Search

Combining dense vector retrieval with sparse keyword retrieval (BM25) improves recall, especially for exact-match queries like product codes, regulatory citations, or proper nouns. OpenSearch Serverless supports both retrieval modes natively. A reciprocal rank fusion (RRF) pass merges the results into a unified ranked list. For a healthcare AI application we deployed with HIPAA-aligned architecture, hybrid search was responsible for roughly a third of the precision improvement over naive RAG — particularly for retrieval of specific ICD codes and procedure references.

Query Rewriting

User queries in conversational interfaces are often incomplete, ambiguous, or reference context from earlier in the conversation. A lightweight rewriting step — a fast Bedrock call using Haiku to expand and disambiguate the query — significantly improves retrieval relevance before the main retrieval pass. Cost impact is minimal; latency impact is 100-200ms on average.

Re-ranking

Retrieve more candidates (top-20 to top-50) and apply a cross-encoder re-ranker to reorder them before passing the final top-k to the generation model. Cohere Rerank is available directly in Bedrock. Re-ranking consistently improves answer quality on complex queries at the cost of added latency (~300-500ms). For low-latency applications, skip it; for knowledge-intensive Q&A, it's often worth the tradeoff.

Semantic Chunking

Rather than splitting on token count, semantic chunking splits on natural content boundaries — section headings, paragraph breaks, topic shifts. This produces chunks that are more coherent as standalone retrieval units. For a GenAI HR assistant we built for an enterprise client, switching from fixed-size to semantic chunking reduced hallucinated citations from 4.2% to 0.8% of responses during evaluation — because retrieved chunks were less likely to straddle two unrelated topics.

Vector Store Selection: OpenSearch vs. pgvector

	OpenSearch Serverless	PostgreSQL + pgvector (RDS/Aurora)
Best for	Large corpora (100K+ vectors), high query throughput, hybrid search	Smaller corpora, existing RDS infrastructure, transactional metadata joins
Hybrid search	Native (BM25 + vector)	Requires custom implementation (FTS + vector separate queries)
Cost at low scale	Higher (minimum OCU pricing)	Lower (shares RDS instance)
Bedrock integration	First-class Knowledge Bases support	Supported via Bedrock Knowledge Bases, but less turnkey
Filtering on metadata	Strong (structured filters on index fields)	Strong (full SQL predicates on metadata columns)

For most greenfield enterprise AI applications, OpenSearch Serverless is the right default. If you have an existing RDS Aurora PostgreSQL instance and your corpus is under 50,000 vectors, pgvector is worth considering to reduce infrastructure sprawl.

Tier 3: Agentic RAG

Agentic RAG replaces the static retrieve-and-generate loop with a planning-and-execution loop. The agent has access to multiple retrieval tools — different vector indexes, structured databases, APIs, document stores — and decides at runtime which tools to call and in what sequence to answer a question. Multi-hop queries that require facts from two different systems, or questions where the agent needs to validate a retrieved answer against a second source, become tractable.

AWS Bedrock Agents provides the orchestration layer. Each tool is exposed as an action group backed by a Lambda function. The agent receives the query, reasons about which tools to use (using the underlying foundation model's planning capability), executes the tool calls, interprets the results, and either generates a final answer or calls additional tools if its confidence in the current answer is insufficient (see Confidence Gating for Enterprise AI).

When to reach for agentic RAG:

The question requires facts from more than one system
Answers need to be validated against a second authoritative source before presentation
The system needs to take actions — not just answer questions — based on what it retrieves
Query routing is complex and not deterministic at design time

Agentic RAG carries additional complexity costs that need to be weighed deliberately: latency is higher (multi-step tool calls), debugging is harder (reasoning steps are not fully transparent without explicit tracing), and cost is higher (multiple Bedrock invocations per query). For straightforward knowledge base Q&A, advanced RAG with good re-ranking usually outperforms agentic RAG on cost and latency while delivering comparable quality.

Retrieval Evaluation: How to Know If Your RAG Is Working

The single most common mistake in RAG implementations is shipping without a retrieval evaluation harness. The metrics that matter most in production:

Context precision: Of the chunks retrieved, what fraction were actually relevant to the query?
Context recall: Of the chunks that would have helped answer the query, what fraction were retrieved?
Answer faithfulness: Is every claim in the generated answer supported by the retrieved context?
Answer relevance: Does the generated answer actually address the question?

Tools like RAGAS provide automated evaluation against all four metrics. In a HIPAA AI implementation, we run the evaluation harness continuously against a fixed set of golden questions as part of the CI/CD pipeline — a regression in answer faithfulness triggers a review before deployment.

Building for Production on AWS Bedrock

EFS AI has deployed RAG architectures on Bedrock across healthcare, HR, and operations use cases. A few architectural decisions that consistently matter in production:

Private VPC deployment: For regulated workloads, all Bedrock API calls route through VPC endpoints. No data traverses the public internet.
Bedrock Guardrails: Enable content filtering, PII detection, and topic blocking at the Bedrock layer — not just in application code.
Metadata filtering: Index document metadata and filter at retrieval time to prevent surfacing outdated policy documents alongside current ones.
Streaming responses: For interactive assistants, stream the generation output. Users tolerate longer total response times when they can see output appearing progressively.

Disclaimer: AI performance varies by data quality, corpus characteristics, and use case. Metrics cited are from production deployment validation. Actual results will vary. EFS designs infrastructure and implements controls aligned with HIPAA and related frameworks. Ultimate compliance responsibility rests with the client organization. AWS and other third-party platforms referenced have their own compliance certifications and shared responsibility models.

Disclaimer: AI performance varies by data quality and use case. Metrics cited are from production deployment validation. Actual results will vary based on data quality, workflow configuration, and organizational context. EFS designs infrastructure and implements controls aligned with HIPAA and related frameworks. Ultimate compliance responsibility rests with the client organization. We do not provide legal advice — consult qualified legal counsel for regulatory interpretation. AWS and other third-party platforms referenced have their own compliance certifications and shared responsibility models.

Let's talk about what you're building.

Our team brings over two decades of experience to every engagement. Tell us about your project and we'll show you what's possible.

Start a Conversation

How Confidence Gating Makes AI Safe for Enterprise Decisions