What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis
Delip Rao, Chris Callison-Burch

TL;DR
This paper analyzes claim verification datasets to understand the reasoning skills they test, revealing biases and limitations in current benchmarks and proposing improvements.
Contribution
It systematically characterizes the reasoning types in verification datasets and highlights biases, informing the design of more comprehensive evaluation benchmarks.
Findings
Direct evidence extraction dominates current datasets.
Multi-sentence synthesis and numerical reasoning are underrepresented.
Error profiles vary significantly across domains.
Abstract
Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
