RAG Chatbot Evaluation: How a Model Sweep + Retrieval Fixes Cut Costs 79% and Raised Quality 19%

A Reddit user walked through a full evaluation of a customer support RAG chatbot that was running on ChromaDB with a default similarity threshold of 0.7 (cosine distance) and using Gemini 3.1 Flash Lite Preview for generation. They found that the most expensive model was the worst performer and that several non-obvious changes actually moved the needle.
Retrieval Problems Masquerade as LLM Problems
The bot responded “I don't have access to specific information about our company's services” when users asked casual openers like “hey what do you guys do?”. The instinct was to tweak prompts or swap models, but the root cause was retrieval: the similarity threshold in ChromaDB was set to 0.7 (cosine distance, lower = more similar, so actually strict). Casual openers didn’t produce embeddings close enough to any chunk, so zero documents were retrieved. The lesson: log what context the LLM actually received before blaming generation. If retrieval returns nothing, no amount of prompt engineering fixes it.
Heuristic Evaluators Are Worse Than None
Keyword matching and source reference counting gave numbers with no correlation to user satisfaction. The author switched to an LLM judge (Claude Haiku 4.5 via OpenRouter) scoring relevance, accuracy, helpfulness, and overall on 0-10. Cost: a few cents per full run.
Deduplicate Chunks
Two turns had three near-identical FAQ chunks in the context window. Adding a check for >80% token overlap from the same source file cleaned up context, reduced tokens, and stopped a hallucination of product names on one turn.
Stricter Grounding Trade-Off
Adding a rule that the agent only states facts from retrieved docs boosted accuracy but reduced helpfulness on knowledge-gap turns: the bot started saying “the docs don't specify this, contact support” instead of guessing. The author notes this is the right call for a factual support bot but must be made consciously.
Model Sweep Results
Running the same eval harness across 5 models showed that Gemma 4 26B scored 7.88 vs. the original Gemini 3.1 Flash Lite Preview’s 7.33 — and cost 75% less per session. Mistral Small 3.2 was close second. Nova Micro was cheapest but terse responses got penalized for not being actionable. Overall quality improved from 6.62 to 7.88 (+19%) and cost dropped from $0.002420 to $0.000509 per session (−79%).
The entire evaluation was done using Neo AI Engineer, which built the eval harness, handled checkpointed runs, dealt with timeout and context limit issues, and consolidated results. The author reviewed everything manually.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw 4.1 with Gemma 4 Stack: Hybrid Architecture and Setup Fixes
A Reddit post details an optimized local agent stack combining OpenClaw 4.1 with Google's Gemma 4 model, featuring a hybrid architecture, specific configuration fixes for Ollama tool calling, and context window adjustments.

Fixing Claude Code's KV Cache Invalidation with Local Backends
Claude Code versions 2.1.36+ inject dynamic telemetry headers and git status updates into every request, breaking prefix matching and forcing full 20K+ token system prompt reprocessing on local backends like llama.cpp. A configuration fix in ~/.claude/settings.json can reduce processing from 60+ seconds to ~4 seconds.

Fixing Claude Cowork 'Failed to start workspace' errors on Windows 11 Home
A user solved Claude Cowork startup errors on Windows 11 Home by installing Windows Subsystem for Linux (WSL2) from the Microsoft Store, which is required for the underlying VM technology.

Custom Command Center App for OpenClaw: React PWA with WebSocket Proxy and Tailscale
A developer built a React PWA command center for their OpenClaw setup, featuring a live agent dashboard, trading desk, and push notifications, using a WebSocket proxy pattern to bridge OpenClaw's loopback-only gateway with devices on a Tailscale mesh.