Brittlebench: Quantifying LLM robustness via prompt sensitivity

Angelika Romanou; Mark Ibrahim; Candace Ross; Chantal Shaib; Kerem Oktar; Samuel J. Bell; Anaelia Ovalle; Jesse Dodge; Antoine Bosselut; Koustuv Sinha; Adina Williams

arXiv:2603.13285·cs.LG·April 7, 2026

Brittlebench: Quantifying LLM robustness via prompt sensitivity

Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar, Samuel J. Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, Adina Williams

PDF

TL;DR

This paper introduces Brittlebench, a framework and evaluation pipeline to measure language model sensitivity to prompt variations, revealing significant performance degradation and ranking shifts caused by semantics-preserving perturbations.

Contribution

The work presents a novel theoretical framework and evaluation pipeline for quantifying and analyzing model brittleness due to prompt sensitivity in language models.

Findings

01

Model performance can degrade by up to 12% due to prompt perturbations.

02

In 63% of cases, perturbations change the relative ranking of models.

03

Semantics-preserving input variations can account for up to half of performance variance.

Abstract

Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we introduce a theoretical framework for quantifying model sensitivity to prompt variants, or brittleness, that can enable us to disentangle data-induced difficulty from prompt-related variability. Using this framework, we design a novel evaluation pipeline, Brittlebench, to holistically evaluate the sensitivity of frontier models. We apply semantics-preserving perturbations to a suite of popular benchmarks, and observe model performance to degrade as much as 12%. However, these perturbations do not affect all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.