DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang; Leonardo F. R. Ribeiro; Momchil Hardalov; Bhuwan Dhingra; Markus Dreyer; Venkatesh Saligrama

arXiv:2603.05912·cs.AI·April 7, 2026

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

PDF

1 Datasets

TL;DR

DeepFact introduces an evolving benchmarking approach with auditable rationales, significantly improving factuality verification accuracy for research reports and enabling better transferability of verifiers.

Contribution

It proposes AtS, a dynamic, dispute-resolving benchmarking framework, and instantiates it as DeepFact-Bench and DeepFact-Eval, enhancing factuality verification for research reports.

Findings

01

Expert accuracy on verifiable claims increased from 60.8% to 90.9% with AtS.

02

DeepFact-Eval outperforms existing verifiers on DeepFact-Bench.

03

DeepFact-Eval transfers well to external datasets.

Abstract

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kkkevinkkk/DeepFactBench
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.