RAG Pipeline Test Shows Cost Per Token Isn't the Right Metric for Model Selection

A developer ran a production-level comparison of three AI models using identical RAG pipelines to answer a nuanced customer query about SOC 2 compliance. The test used Claude Haiku 4.5, Amazon Nova Pro, and Amazon Nova Lite with the same setup: two vector stores (product docs and marketing/competitive docs), 13 Architecture Decision Records as grounding context, approximately 49K input tokens of retrieved context per query, identical system prompts, and the same Bedrock API call structure with only the model ID changed.
Test Setup and Results
The query was: "A customer asked about SOC 2 compliance — how do I respond?" All models received the same RAG context containing a complete playbook with copy-paste emails, objection handlers, competitive positioning, framework-specific compliance answers, and guardrails for what not to say.
Results:
- Nova Lite: 49,067 input tokens, 244 output tokens, 5.5s response time, ~$0.003 cost
- Nova Pro: 49,067 input tokens, 368 output tokens, 13.5s response time, ~$0.040 cost
- Haiku 4.5: 53,674 input tokens, 1,534 output tokens, 15.6s response time, $0.049 cost
Output Quality Comparison
Despite identical context, the models produced dramatically different responses:
- Nova Lite: Generated a four-paragraph generic email that got the core fact right (deploys in your account, no separate SOC 2 report) but included no objection handling, competitive positioning, or nuance from the context. Ended with meta-commentary about adhering to ADRs.
- Nova Pro: Produced seven numbered bullet points covering technical aspects like data residency, authentication, access control, monitoring, patching, secrets management, and compliance scope. Technically accurate but read like pasted AWS documentation with similar meta-commentary.
- Haiku 4.5: Delivered a full playbook with plain-English explanation, copy-paste ready email, pushback handler with Terraform analogy, framework-specific answers for HIPAA, PCI-DSS, SOX, FINRA, "what NOT to say" guardrails, CRM-ready talking points, and competitive positioning against other tools.
Key Finding
The gap wasn't about available information—all models had the same ~49K input tokens containing the complete playbook. The difference was in what each model could extract and synthesize. Nova Lite extracted one fact, Nova Pro organized facts into a list, while Haiku synthesized the context into an actionable toolkit with anticipated follow-ups.
The cost difference between Nova Pro and Haiku was $0.009 per query (less than a penny), but the output quality gap was substantial. The cheapest model per token produced responses that would require 2-3 follow-up queries to match Haiku's single-pass output, ultimately costing more through repeated RAG pipeline usage.
📖 Read the full source: r/ClaudeAI
👀 See Also

Accidental Dashboard Built with Claude Created a Product Commitment Nightmare
A developer built a dashboard with Claude in 2 days, forgot to feature-flag it, 40 customers found it and love it. Now customers want customization, requiring a 3-week refactor to make the hardcoded code extensible.

RunLobster AI Agent Integrates Business Data for Operational Insights
A developer gave RunLobster root access to their business systems including Stripe, CRM, email, and call transcripts. The agent autonomously monitors operations, flags anomalies, and provides detailed briefings based on integrated data analysis.

OpenClaw Case Study: Managing an Email Inbox for 10 Days Without Human Intervention
A freelance consultant gave OpenClaw full access to their Gmail for 10 days while traveling, with instructions to reply in their exact tone, flag only critical items, and handle routine tasks autonomously. The system processed 187 emails with only one minor error.

Readigo: iOS App Uses Claude as AI Reading Coach for Kids
A developer built Readigo, an iOS app where children read stories to an AI dragon character. Claude analyzes speech-to-text transcripts to score reading accuracy, fluency, pacing, and clarity, then generates age-appropriate feedback.