HeadlinesBriefing favicon HeadlinesBriefing.com

LLMs Battle Royale Reveals Hidden Model Performance Gaps

Hacker News •
×

Developer Jacky dropped eleven large language models into a custom 2D battle royale arena, letting them fight across thirty matches with randomized starting positions and shrinking play zones. Each model could edit its own personality files between games, creating genuine strategic adaptation rather than scripted behavior. The experiment ran through Open Router with full tool access for each agent.

Grok 4.1 Fast dominated by winning 13 matches at just $0.97 per victory, while Claude Sonnet 4.6 secured only five wins at $26.78 each—a staggering 27x cost difference. Three models including GPT 5.4-mini burned through $57 combined without a single win. GPT 5.4 racked up the most kills (38) but finished second with just two victories. Traditional benchmarks failed to predict these outcomes.

The real revelation was behavioral divergence: Claude consistently proposed truces, shared its location, and sought alliances before combat. Its soul file reflected training on cooperative dialogue, which proved fatal in a zero-sum scenario. Grok adopted the opposite approach—aggressive but calculated tactics, tracking hit probability and damage meticulously. It mastered car-ramming strategies within weeks.

This experiment exposes how model alignment creates performance trade-offs invisible to standard evaluations. Benchmarks measure capability but miss personality traits that determine real-world effectiveness. For developers choosing models, these behavioral patterns may matter more than raw benchmark scores suggest.