HeadlinesBriefing favicon HeadlinesBriefing.com

AI Code Battle: LLM Skirmish Benchmark

Hacker News •
×

LLM Skirmish introduces a novel benchmark where language models compete in 1v1 real-time strategy games by writing code that executes in live environments. The project addresses an interesting disconnect: while frontier LLMs can one-shot complex coding projects, they struggle with simple game navigation. By leveraging the Screeps paradigm of code-controlled gameplay, the tool showcases LLMs' coding strengths rather than their weaknesses in traditional game scenarios.

Testing revealed Claude Opus 4.5 as the dominant model with an impressive 85% win rate, though it showed early-game vulnerability by focusing too heavily on its economy. GPT 5.2 ranked second with 68% success and demonstrated nearly 1.7x more ELO per dollar than Claude. The evaluation used OpenCode, an open-source coding harness that runs agents in isolated Docker containers, ensuring fair competition while maintaining transparency.

The tournament structure tests in-context learning across five rounds, with models adapting strategies based on previous results. Most models improved performance as rounds progressed, though Gemini 3 Pro showed an anomaly with strong early game performance but later struggles. The system offers a community ladder for strategy submission and includes a visualizer for match playback, making it accessible for developers to test and compare LLM capabilities in a structured competitive environment.