Arena AI Model ELO History Tracks LLM Performance Decay Over Time

Erwin Mayer's Arena AI Model ELO History (live tracker) plots historical ELO ratings from the LMSYS Arena leaderboard to expose performance trends of flagship AI models. The core insight: models that feel great at launch often degrade weeks later due to silent updates, quantization, or safety wrapper changes.
Key Features
- One curve per lab: Instead of a spaghetti chart of every variant, each major AI lab gets a single continuous line representing their highest-rated flagship model at any point in time.
- Flagship tracking logic: The curve sticks to the top-tier model (e.g., Opus stays active until a new higher-scoring model appears). Mid-tier releases like Sonnet don't cause a jump while Opus leads.
- Inference modes merged: Suffixes like
-thinking,-reasoning,-highare collapsed under the base model to avoid flip-flopping. - New release markers: Releases are shown as labeled points, typically accompanied by score jumps.
- Degradation visible: Downward trends within a model's lifecycle between releases are clearly plotted.
- Mobile-friendly + dark mode included.
Data Source
Data is automatically fetched daily from the official LMSYS Arena Dataset on Hugging Face. The Arena uses thousands of blind crowdsourced human evaluations via API endpoints — not consumer web UIs.
Critical Blindspot: Web UI vs. API
The author acknowledges a key limitation: LMSYS tests raw API models. Consumer interfaces (chatgpt.com, gemini.com) add heavy system prompts, safety wrappers, and may silently switch to quantized models under load. The project seeks historical ELO or evaluation datasets from actual web UIs to capture the "nerfing" that users experience. PRs with such datasets are welcome (repo link in footer).
Who It’s For
Developers and researchers tracking LLM model quality over time, especially those deploying AI agents that rely on consistent model behavior.
📖 Read the full source: HN LLM Tools
👀 See Also

Revise: AI Editor Built with Agentic Coding Tools and Y.js CRDT
Revise is an AI editor for documents built from scratch over 10 months using agentic coding tools, with a custom word processor engine and rendering layer that only uses Y.js for the CRDT stack. It integrates multiple AI models including GPT-5.4 variants and Claude models for proofreading and revision.

devopsiphai: Open-source Claude Code skill audits operational health across 6 phases
devopsiphai is an open-source Claude Code skill that audits production project operability using a 6-phase process and ARC framework, outputting letter grades and a structured TODO.md with effort-estimated tasks.

Open-Foundry: A Framework for Multi-Agent Debates with Claude Code
Open-foundry is a Python framework that assembles multiple Claude Code agents into a panel to debate complex questions, producing fully inspectable reasoning trails with transcripts, orchestrator logs, and per-agent working notes.

Using Claude to Automate Mobile App QA with Capacitor WebViews
A developer built an automated QA system using Claude to test a Capacitor-based mobile app across Android and iOS. The approach uses Chrome DevTools Protocol for Android WebViews and screenshots for visual analysis, with Android setup taking 90 minutes versus 6+ hours for iOS.