EvalShift: Open-source CLI for detecting LLM regressions during model migration

EvalShift is an open-source Python CLI designed to detect regressions when switching between LLMs or model versions. It runs your golden input suite against both source and target models, evaluates outputs, and produces a local HTML report — no backend, accounts, or telemetry.
Key features
- Source vs target model comparison via LiteLLM
- JSONL golden suites with tags/slices
- Structural evaluators: JSON schema, regex, length
- Semantic evaluator: embedding similarity
- LLM-as-judge pairwise evaluation
- Tool-call evaluators: tool selection, argument matching, trace structure
- Paired statistical tests: t-test / Wilcoxon
- Effect sizes: Cohen's d
- Multiple-comparison correction: Benjamini-Hochberg
- Slice-level breakdowns
- Local caching to control cost
- Resumable runs
- Single-file HTML report + JSON output
The project's narrow goal is migration safety: “Can I switch models without breaking my prompt/agent behavior?” The author emphasizes catching silent agent regressions — e.g., a newer model producing a decent-looking final answer but skipping a required tool call, calling the wrong tool, or mutating arguments.
Use cases
- Claude 4.5 → Claude 5
- GPT-5 → GPT-6
- Gemini 2 → 3
- Local model → hosted model
The author is seeking feedback on usefulness for local vs hosted models, most important evaluator types for local LLM workflows, and whether tool-call/structured-output regressions are a real pain point. The repo is MIT licensed.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenClaw-Mem0 Plugin Adds Persistent Memory Outside Context Window
The openclaw-mem0 plugin moves memory storage completely outside OpenClaw's context window, preventing loss from compaction or session restarts. It provides automatic recall and capture with both cloud and local setup options.

Clawion: OpenClaw wrapper with Claude Max support and GitHub integration
Clawion is an OpenClaw wrapper that supports Claude Max without requiring an API key. Setup involves picking a template, connecting Telegram, and deploying a code companion with GitHub integration for automated PR creation.

Claude Code UltraPlan Workflow Changes and Performance Observations
Claude Code UltraPlan introduces a cloud-based planning workflow with terminal launch, browser review interface, and execution options. Testing showed approximately 2x faster repeated runs than local planning, with mixed quality improvements.

Claude Code Production Grade Plugin v3.0 Released: Autonomous Software Development Pipeline
Production Grade Plugin v3.0 for Claude Code is now available as free, open-source software under MIT license. The plugin creates a full development pipeline from requirements to deployment with 13 AI skills acting as an engineering team.