NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction

Peter R{\o}ysland Aarnes; Vinay Setty

arXiv:2511.09971·cs.CL·November 14, 2025

NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction

Peter R{\o}ysland Aarnes, Vinay Setty

PDF

Open Access

TL;DR

This paper systematically evaluates the robustness of state-of-the-art language models in numerical veracity prediction using controlled perturbations, revealing significant accuracy drops and highlighting the need for improved robustness strategies.

Contribution

It introduces a comprehensive perturbation-based evaluation framework for numerical fact-checking models, uncovering their vulnerabilities and potential ways to improve robustness.

Findings

01

Models' accuracy drops up to 62% under perturbations

02

Increasing context length often reduces accuracy

03

Enriched extended context with perturbed demonstrations improves performance

Abstract

Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62\% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education