HeadlinesBriefing favicon HeadlinesBriefing.com

LLMs Face Magic: Benchmark Reveals Skill Gaps in Card Game Play

Hacker News •
×

A new benchmark tests large language models on Magic: The Gathering by letting them simulate turns without a rules engine. The test shows models like Gemini 3.5 can orchestrate complex scry and tutor combos, while Opus 4.8 and GPT‑5.5 stumble on card return logic.

The benchmark harnesses an MCP server to let models call primitive library functions—draw, shuffle, or scry—while higher‑level logic stays in the model. Costs surge when models over‑call tools; a single 10‑k token prompt can inflate charges to 110k tokens in a tight loop.

Evaluation proves models excel at judging legality but falter when executing moves. Errors such as forgetting to return exiled cards or self‑correcting after a tool call expose gaps in state management. The project demonstrates that a robust simulation needs more than a language model.

Ultimately, the study warns developers that current LLMs lack the precision to replace rule engines for complex games. A dedicated engine remains essential for reliable play.