RAG Chatbot Evaluation: How Model Sweep + Fix Cut Costs 79%

A Reddit user walked through a full evaluation of a customer support RAG chatbot that was running on ChromaDB with a default similarity threshold of 0.7 (cosine distance) and using Gemini 3.1 Flash Lite Preview for generation. They found that the most expensive model was the worst performer and that several non-obvious changes actually moved the needle.

Retrieval Problems Masquerade as LLM Problems

The bot responded “I don't have access to specific information about our company's services” when users asked casual openers like “hey what do you guys do?”. The instinct was to tweak prompts or swap models, but the root cause was retrieval: the similarity threshold in ChromaDB was set to 0.7 (cosine distance, lower = more similar, so actually strict). Casual openers didn’t produce embeddings close enough to any chunk, so zero documents were retrieved. The lesson: log what context the LLM actually received before blaming generation. If retrieval returns nothing, no amount of prompt engineering fixes it.

Heuristic Evaluators Are Worse Than None

Keyword matching and source reference counting gave numbers with no correlation to user satisfaction. The author switched to an LLM judge (Claude Haiku 4.5 via OpenRouter) scoring relevance, accuracy, helpfulness, and overall on 0-10. Cost: a few cents per full run.

Deduplicate Chunks

Two turns had three near-identical FAQ chunks in the context window. Adding a check for >80% token overlap from the same source file cleaned up context, reduced tokens, and stopped a hallucination of product names on one turn.

Stricter Grounding Trade-Off

Adding a rule that the agent only states facts from retrieved docs boosted accuracy but reduced helpfulness on knowledge-gap turns: the bot started saying “the docs don't specify this, contact support” instead of guessing. The author notes this is the right call for a factual support bot but must be made consciously.

Model Sweep Results

Running the same eval harness across 5 models showed that Gemma 4 26B scored 7.88 vs. the original Gemini 3.1 Flash Lite Preview’s 7.33 — and cost 75% less per session. Mistral Small 3.2 was close second. Nova Micro was cheapest but terse responses got penalized for not being actionable. Overall quality improved from 6.62 to 7.88 (+19%) and cost dropped from $0.002420 to $0.000509 per session (−79%).

The entire evaluation was done using Neo AI Engineer, which built the eval harness, handled checkpointed runs, dealt with timeout and context limit issues, and consolidated results. The author reviewed everything manually.

📖 Read the full source: r/LocalLLaMA