VERISCORE: Evaluating the factuality of verifiable claims in long-form   text generation

Yixiao Song; Yekyung Kim; Mohit Iyyer

arXiv:2406.19276·cs.CL·June 28, 2024

VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation

Yixiao Song, Yekyung Kim, Mohit Iyyer

PDF

Open Access 1 Repo 1 Video

TL;DR

VERISCORE is a new metric designed to evaluate the factual accuracy of long-form text generation, effectively handling both verifiable and unverifiable content across diverse tasks.

Contribution

It introduces VERISCORE, a versatile factuality metric that works with different language models and addresses limitations of existing claim-based evaluation methods.

Findings

01

VERISCORE's claims are more sensible than competing methods.

02

Open-weight models are closing the performance gap with GPT-4o.

03

Factuality scores vary significantly across different tasks.

Abstract

Existing metrics for evaluating the factuality of long-form text, such as FACTSCORE (Min et al., 2023) and SAFE (Wei et al., 2024), decompose an input text into "atomic claims" and verify each against a knowledge base like Wikipedia. These metrics are not suitable for most generation tasks because they assume that every claim is verifiable (i.e., can plausibly be proven true or false). We address this issue with VERISCORE, a metric for diverse long-form generation tasks that contain both verifiable and unverifiable content. VERISCORE can be effectively implemented with either closed or fine-tuned open-weight language models, and human evaluation confirms that VERISCORE's extracted claims are more sensible than those from competing methods across eight different long-form tasks. We use VERISCORE to evaluate generations from 16 different models across multiple long-form tasks and find…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yixiao-Song/VeriScore
noneOfficial

Videos

VeriScore: Evaluating the factuality of verifiable claims in long-form text generation· underline

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques

MethodsBalanced Selection