NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction
Peter R{\o}ysland Aarnes, Vinay Setty

TL;DR
This paper systematically evaluates the robustness of state-of-the-art language models in numerical veracity prediction using controlled perturbations, revealing significant accuracy drops and highlighting the need for improved robustness strategies.
Contribution
It introduces a comprehensive perturbation-based evaluation framework for numerical fact-checking models, uncovering their vulnerabilities and potential ways to improve robustness.
Findings
Models' accuracy drops up to 62% under perturbations
Increasing context length often reduces accuracy
Enriched extended context with perturbed demonstrations improves performance
Abstract
Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62\% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
