EvalShift: Open-source CLI for detecting LLM regressions during model migration

✍️ OpenClawRadar📅 Published: May 15, 2026🔗 Source
EvalShift: Open-source CLI for detecting LLM regressions during model migration
Ad

EvalShift is an open-source Python CLI designed to detect regressions when switching between LLMs or model versions. It runs your golden input suite against both source and target models, evaluates outputs, and produces a local HTML report — no backend, accounts, or telemetry.

Key features

  • Source vs target model comparison via LiteLLM
  • JSONL golden suites with tags/slices
  • Structural evaluators: JSON schema, regex, length
  • Semantic evaluator: embedding similarity
  • LLM-as-judge pairwise evaluation
  • Tool-call evaluators: tool selection, argument matching, trace structure
  • Paired statistical tests: t-test / Wilcoxon
  • Effect sizes: Cohen's d
  • Multiple-comparison correction: Benjamini-Hochberg
  • Slice-level breakdowns
  • Local caching to control cost
  • Resumable runs
  • Single-file HTML report + JSON output

The project's narrow goal is migration safety: “Can I switch models without breaking my prompt/agent behavior?” The author emphasizes catching silent agent regressions — e.g., a newer model producing a decent-looking final answer but skipping a required tool call, calling the wrong tool, or mutating arguments.

Ad

Use cases

  • Claude 4.5 → Claude 5
  • GPT-5 → GPT-6
  • Gemini 2 → 3
  • Local model → hosted model

The author is seeking feedback on usefulness for local vs hosted models, most important evaluator types for local LLM workflows, and whether tool-call/structured-output regressions are a real pain point. The repo is MIT licensed.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also