Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation
Nazanin Jafari, James Allan, Mohit Iyyer

TL;DR
This paper introduces a new evaluation framework for long-form LLM outputs that jointly measures factuality precision and recall, emphasizing the importance of coverage of relevant facts.
Contribution
It proposes an importance-aware recall metric for factuality evaluation, addressing the gap in existing precision-focused methods.
Findings
Current LLMs perform better on factuality precision than recall.
Models are more effective at covering highly important facts.
Factual incompleteness is a major limitation in long-form generation.
Abstract
Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
