Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
Yang Liu, Da Song, Armstrong Foundjem, Heng Li, Foutse Khomh

TL;DR
This paper investigates how Chain-of-Thought prompting affects the robustness of large language models for code, revealing that its benefits depend on model, task, and prompt explicitness, with specific failure modes identified.
Contribution
It introduces a large-scale empirical study analyzing CoT robustness, defining structural anchors, and identifying failure modes and diagnostic signals for code generation.
Findings
CoT benefits vary by model, task, and prompt explicitness.
CoT and No-CoT have different robustness profiles and failure modes.
Identifies three trajectory deformations: Lengthening, Branching, Simplification.
Abstract
Chain-of-Thought (CoT) prompting is widely used to elicit explicit reasoning from large language models for code (LLM4Code). However, its impact on robustness and the stability of reasoning trajectories under realistic input perturbations remains poorly understood. Prior work has largely evaluated CoT through final correctness, leaving a critical gap in understanding how CoT reshapes internal uncertainty dynamics and why it sometimes harms rather than helps code generation. We suggest that CoT is not uniformly beneficial; instead, its robustness depends on whether perturbations destabilize structurally sensitive commitment points along the reasoning-to-code trajectory. We conduct a controlled, large-scale empirical study of CoT across six models and two code benchmarks (MHPP and BigCodeBench), subjecting task docstrings to systematic character-, word-, and sentence-level perturbations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
