Skip navigation
RAG Architecture Patterns on AWS Bedrock: Naive, Advanced, and Agentic

RAG Architecture Patterns on AWS Bedrock: Naive, Advanced, and Agentic

Key takeaway: RAG (Retrieval-Augmented Generation) on AWS Bedrock comes in three distinct architectural tiers — naive, advanced, and agentic — each with different complexity, cost, and performance characteristics. Choosing the right tier for your use case is the most consequential early architectural decision, and most teams default to the wrong one.

Retrieval-Augmented Generation is now the default pattern for enterprise AI applications that need to operate on proprietary data without the risk, cost, and latency of full model fine-tuning. The core idea is straightforward: instead of baking organizational knowledge into model weights, you retrieve relevant context at query time and inject it into the prompt. The model reasons over your data without ever having been trained on it.

In practice, "RAG on AWS Bedrock" describes a spectrum of architectures with meaningfully different characteristics. This guide covers the three main tiers EFS AI works with in production — what distinguishes them, when each is appropriate, and the specific AWS services and configuration decisions that matter most for each.

The Three RAG Tiers

TierRetrieval MechanismQuery HandlingComplexityBest For
Naive RAGSingle vector similarity searchOne-shot retrieval then generateLowSingle-topic knowledge bases, FAQ, document Q&A
Advanced RAGHybrid search, re-ranking, query rewritingMulti-step retrieval with scoringMediumMulti-domain corpora, conversational assistants, document analysis
Agentic RAGDynamic tool selection across multiple retrieval sourcesMulti-hop reasoning, iterative retrievalHighAutonomous workflows, multi-system integrations, complex Q&A

Tier 1: Naive RAG

Naive RAG is the simplest working implementation: embed your documents, store the vectors, embed the query, retrieve the top-k most similar chunks, stuff them into the prompt, and generate. It works well enough for narrow, well-scoped knowledge bases where the user's question closely resembles the content being retrieved.

On AWS Bedrock, a typical naive RAG stack looks like:

Chunking strategy for naive RAG: Fixed-size chunks (512-1024 tokens) with 10-20% overlap work adequately when documents are relatively uniform in structure. For mixed-format corpora (PDFs, HTML, structured tables), fixed-size chunking degrades retrieval precision — this is usually the first sign that you need to move to advanced RAG.

Where naive RAG breaks down: Precision drops when users ask questions that require synthesizing information across multiple document sections, when the query vocabulary differs significantly from document vocabulary (lexical gap), or when the corpus grows beyond a few thousand documents and noise in the top-k results increases.

Tier 2: Advanced RAG

Advanced RAG adds a set of precision techniques around the basic retrieve-and-generate loop. In our production implementations, the most impactful techniques — roughly in order of improvement per unit of added complexity — are:

Hybrid Search

Combining dense vector retrieval with sparse keyword retrieval (BM25) improves recall, especially for exact-match queries like product codes, regulatory citations, or proper nouns. OpenSearch Serverless supports both retrieval modes natively. A reciprocal rank fusion (RRF) pass merges the results into a unified ranked list. For a healthcare AI application we deployed with HIPAA-aligned architecture, hybrid search was responsible for roughly a third of the precision improvement over naive RAG — particularly for retrieval of specific ICD codes and procedure references.

Query Rewriting

User queries in conversational interfaces are often incomplete, ambiguous, or reference context from earlier in the conversation. A lightweight rewriting step — a fast Bedrock call using Haiku to expand and disambiguate the query — significantly improves retrieval relevance before the main retrieval pass. Cost impact is minimal; latency impact is 100-200ms on average.

Re-ranking

Retrieve more candidates (top-20 to top-50) and apply a cross-encoder re-ranker to reorder them before passing the final top-k to the generation model. Cohere Rerank is available directly in Bedrock. Re-ranking consistently improves answer quality on complex queries at the cost of added latency (~300-500ms). For low-latency applications, skip it; for knowledge-intensive Q&A, it's often worth the tradeoff.

Semantic Chunking

Rather than splitting on token count, semantic chunking splits on natural content boundaries — section headings, paragraph breaks, topic shifts. This produces chunks that are more coherent as standalone retrieval units. For a GenAI HR assistant we built for an enterprise client, switching from fixed-size to semantic chunking reduced hallucinated citations from 4.2% to 0.8% of responses during evaluation — because retrieved chunks were less likely to straddle two unrelated topics.

Vector Store Selection: OpenSearch vs. pgvector

OpenSearch ServerlessPostgreSQL + pgvector (RDS/Aurora)
Best forLarge corpora (100K+ vectors), high query throughput, hybrid searchSmaller corpora, existing RDS infrastructure, transactional metadata joins
Hybrid searchNative (BM25 + vector)Requires custom implementation (FTS + vector separate queries)
Cost at low scaleHigher (minimum OCU pricing)Lower (shares RDS instance)
Bedrock integrationFirst-class Knowledge Bases supportSupported via Bedrock Knowledge Bases, but less turnkey
Filtering on metadataStrong (structured filters on index fields)Strong (full SQL predicates on metadata columns)

For most greenfield enterprise AI applications, OpenSearch Serverless is the right default. If you have an existing RDS Aurora PostgreSQL instance and your corpus is under 50,000 vectors, pgvector is worth considering to reduce infrastructure sprawl.

Tier 3: Agentic RAG

Agentic RAG replaces the static retrieve-and-generate loop with a planning-and-execution loop. The agent has access to multiple retrieval tools — different vector indexes, structured databases, APIs, document stores — and decides at runtime which tools to call and in what sequence to answer a question. Multi-hop queries that require facts from two different systems, or questions where the agent needs to validate a retrieved answer against a second source, become tractable.

AWS Bedrock Agents provides the orchestration layer. Each tool is exposed as an action group backed by a Lambda function. The agent receives the query, reasons about which tools to use (using the underlying foundation model's planning capability), executes the tool calls, interprets the results, and either generates a final answer or calls additional tools if its confidence in the current answer is insufficient (see Confidence Gating for Enterprise AI).

When to reach for agentic RAG:

Agentic RAG carries additional complexity costs that need to be weighed deliberately: latency is higher (multi-step tool calls), debugging is harder (reasoning steps are not fully transparent without explicit tracing), and cost is higher (multiple Bedrock invocations per query). For straightforward knowledge base Q&A, advanced RAG with good re-ranking usually outperforms agentic RAG on cost and latency while delivering comparable quality.

Retrieval Evaluation: How to Know If Your RAG Is Working

The single most common mistake in RAG implementations is shipping without a retrieval evaluation harness. The metrics that matter most in production:

Tools like RAGAS provide automated evaluation against all four metrics. In a HIPAA AI implementation, we run the evaluation harness continuously against a fixed set of golden questions as part of the CI/CD pipeline — a regression in answer faithfulness triggers a review before deployment.

Building for Production on AWS Bedrock

EFS AI has deployed RAG architectures on Bedrock across healthcare, HR, and operations use cases. A few architectural decisions that consistently matter in production:


Disclaimer: AI performance varies by data quality, corpus characteristics, and use case. Metrics cited are from production deployment validation. Actual results will vary. EFS designs infrastructure and implements controls aligned with HIPAA and related frameworks. Ultimate compliance responsibility rests with the client organization. AWS and other third-party platforms referenced have their own compliance certifications and shared responsibility models.


Disclaimer: AI performance varies by data quality and use case. Metrics cited are from production deployment validation. Actual results will vary based on data quality, workflow configuration, and organizational context. EFS designs infrastructure and implements controls aligned with HIPAA and related frameworks. Ultimate compliance responsibility rests with the client organization. We do not provide legal advice — consult qualified legal counsel for regulatory interpretation. AWS and other third-party platforms referenced have their own compliance certifications and shared responsibility models.

Let's talk about what you're building.

Our team brings over two decades of experience to every engagement. Tell us about your project and we'll show you what's possible.

Related

How Confidence Gating Makes AI Safe for Enterprise Decisions

How Confidence Gating Makes AI Safe for Enterprise Decisions

How confidence gating prevents autonomous AI from making bad decisions in production — with EDI automation and HIPAA workflow examples from EFS.

Agentic vs. Generative AI: A Decision Framework for Enterprise Leaders

Agentic vs. Generative AI: A Decision Framework for Enterprise Leaders

A practical decision framework for choosing between agentic and generative AI — with a decision matrix and real case studies from EFS.

AI Governance in Regulated Industries

AI Governance in Regulated Industries

How production AI governance works in HIPAA and SOC 2 environments — model controls, data classification, audit trails, and incident response.