DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

TL;DR
DeepFact introduces an evolving benchmarking approach with auditable rationales, significantly improving factuality verification accuracy for research reports and enabling better transferability of verifiers.
Contribution
It proposes AtS, a dynamic, dispute-resolving benchmarking framework, and instantiates it as DeepFact-Bench and DeepFact-Eval, enhancing factuality verification for research reports.
Findings
Expert accuracy on verifiable claims increased from 60.8% to 90.9% with AtS.
DeepFact-Eval outperforms existing verifiers on DeepFact-Bench.
DeepFact-Eval transfers well to external datasets.
Abstract
Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
