Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations
Shou-Tzu Han, Rodrigue Rizk, KC Santosh

TL;DR
This paper investigates why large language models are fragile to meaning-preserving perturbations, introduces diagnostic tools to analyze failure mechanisms, and proposes a taxonomy and repair strategies for these failures.
Contribution
It presents the Mechanistic Perturbation Diagnostics framework and a failure taxonomy, advancing understanding of LLM fragility and potential repair methods.
Findings
Number paraphrasing causes more disruption than name swaps.
CAI metric predicts failures better than divergence layer.
Activation patching can recover some localized failures.
Abstract
Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
