HeadlinesBriefing favicon HeadlinesBriefing.com

Tracking AI Model Performance Decay Over Time

Hacker News •
×

A developer built a live dashboard tracking the performance lifecycle of flagship AI models by visualizing historical ELO ratings from LMSYS Arena. The tool plots exactly one continuous curve per major AI lab, dynamically tracking their highest-rated flagship model over time rather than every variant. This approach makes generational jumps and performance decays far easier to spot.

The dashboard pulls data daily from the official LM Arena Leaderboard on Hugging Face, which relies on thousands of blind, crowdsourced human evaluations. Each curve tracks the lab's top-performing flagship at any given moment—mid-tier models get ignored if a higher-tier model is still leading. Inference-mode variants with suffixes like -thinking or -reasoning get merged to prevent the chart from flipping between them.

The developer acknowledges a significant blindspot: Arena tests raw API endpoints, but consumer chat interfaces often layer on system prompts, safety filters, and may silently switch to quantized models under load. This "nerfing" that everyday users experience doesn't show up in API benchmarks. The project is open-source and the creator is seeking datasets that specifically test consumer web UIs rather than raw APIs.