The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground   Responses to Long-Form Input

Alon Jacovi; Andrew Wang; Chris Alberti; Connie Tao; Jon Lipovetz,; Kate Olszewska; Lukas Haas; Michelle Liu; Nate Keating; Adam Bloniarz; Carl; Saroufim; Corey Fry; Dror Marcus; Doron Kukliansky; Gaurav Singh Tomar; James; Swirhun; Jinwei Xing; Lily Wang; Madhu Gurumurthy; Michael Aaron; Moran; Ambar; Rachana Fellinger; Rui Wang; Zizhao Zhang; Sasha Goldshtein; Dipanjan; Das

arXiv:2501.03200·cs.CL·January 7, 2025·2 cites

The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input

Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz,, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl, Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James, Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy

PDF

Open Access

TL;DR

The paper introduces FACTS Grounding, a benchmark and leaderboard for evaluating large language models' ability to generate factually accurate, long-form responses grounded in extensive context documents, using automated evaluation methods.

Contribution

It presents a new benchmark and leaderboard for assessing LLMs' factual grounding in long-form responses, with a comprehensive automated evaluation framework.

Findings

01

Automated judge models effectively evaluate factual grounding.

02

Benchmark supports long documents up to 32k tokens.

03

Active maintenance ensures ongoing evaluation and comparison.

Abstract

We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates language models' ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsResearch Data Management Practices