HeadlinesBriefing favicon HeadlinesBriefing.com

SOB Benchmark Reveals Real Value Accuracy Gap in LLM Structured Output

Hacker News •
×

SOB, a new Structured Output Benchmark, measures both JSON schema compliance and factual value accuracy across text, image, and audio. It pairs each record with a verified schema and ground‑truth answer, flagging hallucinated fields that pass schema checks but misrepresent data.

The benchmark exposes a wide gap: top models score 95%+ on JSON parsing yet lag 15–30 points in value accuracy. GLM‑4.7 tops text, Gemma‑4‑31B leads images, and Gemini‑2.5‑Flash dominates audio, showing modality‑specific strengths.

Metrics include Value Accuracy, JSON Pass Rate, Path Recall, Structure Coverage, Type Safety, Faithfulness, and Perfect Response. Only models that return every leaf value correctly achieve the perfect response rate, which hovers around 50% even for best performers.

SOB’s scoring gates prevent inflated scores: a parse failure zeroes downstream metrics, and value accuracy counts only returned fields. This approach forces developers to move beyond schema checks and focus on data integrity in production workflows.