HeadlinesBriefing favicon HeadlinesBriefing.com

Google AI Explores Optimal Rater Counts for Reproducible AI Benchmarks

Google AI Blog •
×

Reproducibility in machine learning hinges on consistent results across repeated experiments, but human subjectivity complicates this goal. Google AI’s recent study, *Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation*, examines how balancing the number of items evaluated (N) against human raters per item (K) impacts reliability. The research challenges the industry standard of 1–5 raters per item, arguing that 10 or more raters are often necessary to capture nuanced disagreements. This shift from a ‘breadth-first’ approach—evaluating many items with few raters—to a ‘depth-first’ strategy—focusing on fewer items with diverse raters—could redefine how AI models are tested.

The study employed a simulator using real-world datasets like the Toxicity dataset (17,280 raters) and DICES (123 raters across 16 safety dimensions) to stress-test configurations. By varying N (from 100 to 50,000 items) and K (1 to 500 raters), researchers identified optimal trade-offs. For metrics prioritizing majority consensus, broader item sampling mattered more. For capturing opinion diversity, deeper rater involvement was critical. The work highlights that even modest budgets of 1,000 total annotations can yield reliable results if allocated strategically.

Key findings debunk the ‘one size fits all’ myth. While 3–5 raters often fail to reflect true human variability, over-investing in raters without aligning with evaluation goals risks inefficiency. The study also explored skewed data (e.g., 99% spam emails) and multi-category labels, showing adaptability is key. Collaborators from RIT, including Prof. Christopher Homan, emphasized that reproducibility isn’t just technical—it’s about embracing human complexity.

This research reshapes AI benchmarking, urging practitioners to move beyond the ‘single truth’ paradigm. As AI ventures into subjective domains like ethics, understanding disagreement becomes as vital as consensus. Google AI’s open-sourced simulator offers a practical tool for building benchmarks that mirror real-world human perspectives, ensuring models are evaluated fairly and robustly.