HeadlinesBriefing favicon HeadlinesBriefing.com

LLMs Fail to Model Real Systems in TLA+ — Benchmark Reveals 46% Conformance Gap

Hacker News •
×

The Specula team built SysMoBench, a benchmark testing whether LLMs can actually model real computing systems in TLA+ or simply recite textbook specifications. Their investigation began when asking Claude to write a TLA+ spec for Etcd's Raft implementation—only to discover it reproduced the spec from the Raft paper's appendix rather than Etcd-specific code.

The benchmark evaluates eleven distributed systems across four phases: syntax checking, runtime execution, conformance testing (comparing traces from real code against the model), and invariant verification. Results reveal a stark gap: leading LLMs like Claude, GPT, Gemini, and others achieve near-perfect syntax scores but only around 46% on conformance and 41% on invariant checks.

Two systematic failure patterns emerged. LLMs use textbook formalization templates that don't match actual system semantics—for example, modeling ZooKeeper's recvset as a set union when the code actually uses a map that overwrites keys. They also merge multi-step code operations into single atomic guards, erasing states the real system passes through. Both patterns produce structurally correct specs that fail to model actual implementations.

The key insight: syntax validation alone cannot distinguish "modeling Etcd" from "reciting the Raft paper." SysMoBench's Transition Validation approach cuts execution traces into pre-state, action, and post-state windows, then tests each transition independently against TLC to pinpoint exactly where specs diverge from real system behavior.