AI Model ELO History: Track LLM Performance Decay

Erwin Mayer's Arena AI Model ELO History (live tracker) plots historical ELO ratings from the LMSYS Arena leaderboard to expose performance trends of flagship AI models. The core insight: models that feel great at launch often degrade weeks later due to silent updates, quantization, or safety wrapper changes.

Key Features

One curve per lab: Instead of a spaghetti chart of every variant, each major AI lab gets a single continuous line representing their highest-rated flagship model at any point in time.
Flagship tracking logic: The curve sticks to the top-tier model (e.g., Opus stays active until a new higher-scoring model appears). Mid-tier releases like Sonnet don't cause a jump while Opus leads.
Inference modes merged: Suffixes like -thinking, -reasoning, -high are collapsed under the base model to avoid flip-flopping.
New release markers: Releases are shown as labeled points, typically accompanied by score jumps.
Degradation visible: Downward trends within a model's lifecycle between releases are clearly plotted.
Mobile-friendly + dark mode included.

Data Source

Data is automatically fetched daily from the official LMSYS Arena Dataset on Hugging Face. The Arena uses thousands of blind crowdsourced human evaluations via API endpoints — not consumer web UIs.

Critical Blindspot: Web UI vs. API

The author acknowledges a key limitation: LMSYS tests raw API models. Consumer interfaces (chatgpt.com, gemini.com) add heavy system prompts, safety wrappers, and may silently switch to quantized models under load. The project seeks historical ELO or evaluation datasets from actual web UIs to capture the "nerfing" that users experience. PRs with such datasets are welcome (repo link in footer).