The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Aileen Cheng; Alon Jacovi; Amir Globerson; Ben Golan; Charles Kwong; Chris Alberti; Connie Tao; Eyal Ben-David; Gaurav Singh Tomar; Lukas Haas; Yonatan Bitton; Adam Bloniarz; Aijun Bai; Andrew Wang; Anfal Siddiqui; Arturo Bajuelos Castillo; Aviel Atias; Chang Liu; Corey Fry; Daniel Balle; Deepanway Ghosal; Doron Kukliansky; Dror Marcus; Elena Gribovskaya; Eran Ofek; Honglei Zhuang; Itay Laish; Jan Ackermann; Lily Wang; Meg Risdal; Megan Barnes; Michael Fink; Mohamed Amin; Moran Ambar; Natan Potikha; Nikita Gupta; Nitzan Katz; Noam Velan; Ofir Roval; Ori Ram; Polina Zablotskaia; Prathamesh Bang; Priyanka Agrawal; Rakesh Ghiya; Sanjay Ganapathy; Simon Baumgartner; Sofia Erell; Sushant Prakash; Thibault Sellam; Vikram Rao; Xuanhui Wang; Yaroslav Akulov; Yulong Yang; Zhen Yang; Zhixin Lai; Zhongru Wu; Anca Dragan; Avinatan Hassidim; Fernando Pereira; Slav Petrov; Srinivasan Venkatachary; Tulsee Doshi; Yossi Matias; Sasha Goldshtein; Dipanjan Das

arXiv:2512.10791·cs.CL·December 12, 2025

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong, Chris Alberti, Connie Tao, Eyal Ben-David, Gaurav Singh Tomar, Lukas Haas, Yonatan Bitton, Adam Bloniarz, Aijun Bai, Andrew Wang, Anfal Siddiqui, Arturo Bajuelos Castillo, Aviel Atias, Chang Liu, Corey Fry

PDF

Open Access

TL;DR

The FACTS Leaderboard introduces a comprehensive, multi-faceted benchmark suite to evaluate large language models' factual accuracy across diverse scenarios, including multimodal, parametric, search, and grounding tasks.

Contribution

It presents a new holistic benchmark suite with automated scoring for assessing factuality in various contexts, enhancing evaluation robustness.

Findings

01

Provides a balanced assessment of models' factuality

02

Includes four diverse sub-leaderboards for comprehensive evaluation

03

Employs automated judges for scalable scoring

Abstract

We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Multimodal Machine Learning Applications