RAG Pipeline Test: Cost Per Token Misleads Model Selection

A developer ran a production-level comparison of three AI models using identical RAG pipelines to answer a nuanced customer query about SOC 2 compliance. The test used Claude Haiku 4.5, Amazon Nova Pro, and Amazon Nova Lite with the same setup: two vector stores (product docs and marketing/competitive docs), 13 Architecture Decision Records as grounding context, approximately 49K input tokens of retrieved context per query, identical system prompts, and the same Bedrock API call structure with only the model ID changed.

Test Setup and Results

The query was: "A customer asked about SOC 2 compliance — how do I respond?" All models received the same RAG context containing a complete playbook with copy-paste emails, objection handlers, competitive positioning, framework-specific compliance answers, and guardrails for what not to say.

Results:

Nova Lite: 49,067 input tokens, 244 output tokens, 5.5s response time, ~$0.003 cost
Nova Pro: 49,067 input tokens, 368 output tokens, 13.5s response time, ~$0.040 cost
Haiku 4.5: 53,674 input tokens, 1,534 output tokens, 15.6s response time, $0.049 cost

Output Quality Comparison

Despite identical context, the models produced dramatically different responses:

Nova Lite: Generated a four-paragraph generic email that got the core fact right (deploys in your account, no separate SOC 2 report) but included no objection handling, competitive positioning, or nuance from the context. Ended with meta-commentary about adhering to ADRs.
Nova Pro: Produced seven numbered bullet points covering technical aspects like data residency, authentication, access control, monitoring, patching, secrets management, and compliance scope. Technically accurate but read like pasted AWS documentation with similar meta-commentary.
Haiku 4.5: Delivered a full playbook with plain-English explanation, copy-paste ready email, pushback handler with Terraform analogy, framework-specific answers for HIPAA, PCI-DSS, SOX, FINRA, "what NOT to say" guardrails, CRM-ready talking points, and competitive positioning against other tools.

Key Finding

The gap wasn't about available information—all models had the same ~49K input tokens containing the complete playbook. The difference was in what each model could extract and synthesize. Nova Lite extracted one fact, Nova Pro organized facts into a list, while Haiku synthesized the context into an actionable toolkit with anticipated follow-ups.

The cost difference between Nova Pro and Haiku was $0.009 per query (less than a penny), but the output quality gap was substantial. The cheapest model per token produced responses that would require 2-3 follow-up queries to match Haiku's single-pass output, ultimately costing more through repeated RAG pipeline usage.

📖 Read the full source: r/ClaudeAI

RAG Pipeline Test Shows Cost Per Token Isn't the Right Metric for Model Selection

Test Setup and Results

Output Quality Comparison

Key Finding

👀 See Also

Personal Finance Dashboard Built with Claude AI: Self-Hosted with Google Sheets Backend

Developer Switches from Specs to Proposals for Parallel Claude Code Sessions

OpenClaw Case Study: Building 4 Products and Launching a Business in 3 Weeks

Claude Opus 4.6 Used to Build Dating App with 700+ Users in One Month