FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality
Mingda Chen, Yang Li, Xilun Chen, Adina Williams, Gargi Ghosh, Scott Yih

TL;DR
FACTORY is a large, human-verified prompt set designed to rigorously evaluate the factual accuracy of long-form responses generated by language models, revealing significant challenges for current models.
Contribution
We introduce FACTORY, a novel, human-verified benchmark with challenging prompts that improve the assessment of long-form factuality in language models.
Findings
Approximately 40% of model responses contain non-factual claims.
FACTORY is more challenging than existing datasets for factuality evaluation.
Models struggle to reason across long-tailed facts in the FACTORY benchmark.
Abstract
Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods
