TL;DR
This paper empirically evaluates how large language models handle various structured perturbations in chain-of-thought reasoning, revealing different vulnerability patterns and the impact of model scaling.
Contribution
It provides a comprehensive analysis of LLM robustness to five types of reasoning perturbations across multiple model sizes, highlighting scaling effects and robustness challenges.
Findings
MathError perturbations cause significant accuracy loss in small models but improve with scale.
UnitConversion remains difficult across all model sizes.
ExtraSteps perturbations minimally affect accuracy even in small models.
Abstract
Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count, testing their ability to complete mathematical reasoning tasks despite perturbations injected in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
