FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

Mingda Chen; Yang Li; Xilun Chen; Adina Williams; Gargi Ghosh; Scott Yih

arXiv:2508.00109·cs.CL·August 4, 2025

FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

Mingda Chen, Yang Li, Xilun Chen, Adina Williams, Gargi Ghosh, Scott Yih

PDF

Open Access 1 Datasets

TL;DR

FACTORY is a large, human-verified prompt set designed to rigorously evaluate the factual accuracy of long-form responses generated by language models, revealing significant challenges for current models.

Contribution

We introduce FACTORY, a novel, human-verified benchmark with challenging prompts that improve the assessment of long-form factuality in language models.

Findings

01

Approximately 40% of model responses contain non-factual claims.

02

FACTORY is more challenging than existing datasets for factuality evaluation.

03

Models struggle to reason across long-tailed facts in the FACTORY benchmark.

Abstract

Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

facebook/FACTORY
dataset· 144 dl
144 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods