VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

Xin Liu; Lechen Zhang; Sheza Munir; Yiyang Gu; Lu Wang

arXiv:2505.09701·cs.CL·September 30, 2025

VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts

Xin Liu, Lechen Zhang, Sheza Munir, Yiyang Gu, Lu Wang

PDF

Open Access 1 Datasets

TL;DR

VeriFact is a new framework that improves the evaluation of long-form responses from language models by better extracting and verifying facts, supported by a novel benchmark that measures both precision and recall.

Contribution

The paper introduces VeriFact, a framework for improved factuality evaluation, and FactRBench, a benchmark that assesses both precision and recall in long-form model responses.

Findings

01

VeriFact enhances fact completeness and preserves relational facts.

02

Larger models improve both precision and recall, but high precision doesn't always mean high recall.

03

FactRBench enables comprehensive evaluation of factuality in long-form responses.

Abstract

Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

launch/FactRBench
dataset· 113 dl
113 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques