Robust Reasoning Benchmark
Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey

TL;DR
The paper introduces the Robust Reasoning Benchmark (RRB) to evaluate LLMs' resilience to textual perturbations, revealing significant failure modes and proposing the need for explicit contextual resets for reliable reasoning.
Contribution
It presents a new benchmark with perturbations to test LLM robustness and uncovers failure modes like attention dilution affecting reasoning accuracy.
Findings
Frontier models are largely resilient to perturbations.
Claude model refuses many transformed prompts.
Up to 54% accuracy drops under structural noise.
Abstract
While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13 deterministic textual perturbations applied to AIME 2024 and AIME 2025. Evaluating 8 state-of-the-art models, we find that frontier models are largely resilient, with the notable exception of Claude, which categorically refuses many transformed prompts. Open-weights reasoning models exhibit a range of failure modes under structural noise (cognitive thrashing, tokenization breakdown, and reasoning collapse), with up to 54% average accuracy drops across perturbations and up to 100% on some. We further study one of these failure modes in isolation: attention dilution caused by the model's own chain-of-thought. By tasking models with solving multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
