When Verification Fails: How Compositionally Infeasible Claims Escape Rejection
Muxin Liu, Delip Rao, Grace Kim, Chris Callison-Burch

TL;DR
This paper reveals that current scientific claim verification models often rely on shortcut reasoning, failing to properly evaluate compositional infeasibility, which undermines the reliability of verification benchmarks.
Contribution
It demonstrates the limitations of existing benchmarks in distinguishing rigorous verification from shortcut strategies and introduces compositionally infeasible claims to better evaluate model reasoning.
Findings
Models over-accept compositionally infeasible claims, indicating reliance on shortcut reasoning.
Existing benchmarks cannot differentiate between true verification and salient-constraint checking.
Verification thresholds vary across models, reflecting structural biases rather than reasoning ability.
Abstract
Scientific claim verification, the task of determining whether claims are entailed by scientific evidence, is fundamental to establishing discoveries in evidence while preventing misinformation. This process involves evaluating each asserted constraint against validated evidence. Under the Closed-World Assumption (CWA), a claim is accepted if and only if all asserted constraints are positively supported. We show that existing verification benchmarks cannot distinguish models enforcing this standard from models applying a simpler shortcut called salient-constraint checking, which applies CWA's rejection criterion only to the most salient constraint and accepts when that constraint is supported. Because existing benchmarks construct infeasible claims by perturbing a single salient element they are insufficient at distinguishing between rigorous claim verification and simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
