HeadlinesBriefing favicon HeadlinesBriefing.com

Google DeepMind Unveils FACTS Benchmark Suite to Measure LLM Accuracy

Google DeepMind Blog •
×

Google DeepMind has launched the FACTS Benchmark Suite, a collection of four tests that probe large language models for accuracy in internal knowledge, web‑search synthesis, image grounding, and context‑based grounding. The suite builds on the original FACTS Grounding benchmark and adds 3,513 curated examples split between public and private sets.

The Parametric benchmark tests trivia‑style questions answered only from a model’s internal knowledge, featuring 1,052 public and 1,052 private items. Search challenges models to pull multiple facts from the web, with 890 public and 994 private prompts. Multimodal asks image‑based questions; it contains 711 public and 811 private cases for grounded responses in real world.

In benchmark runs, Gemini 3 Pro tops the field with a 68.8% overall FACTS Score. Improvements over Gemini 2.5 Pro hit Search and Parametric slices, cutting error rates by 55% and 35% respectively. Yet all 15 leading models score below 70%, highlighting room for progress in factual accuracy for future model development and deployment in critical applications.

The suite’s public leaderboard, managed by Kaggle, will let researchers compare models on held‑out data, fostering transparency. DeepMind’s detailed tech report outlines the evaluation methodology and benchmark construction. By exposing weaknesses in multimodal and search‑based reasoning, the FACTS Benchmark Suite pushes the industry toward more reliable, trustworthy language tools for users and developers who rely on them.