How Robustly do LLMs Understand Execution Semantics?
Claudio Spiess, Prem Devanbu, Earl T. Barr

TL;DR
This paper investigates the robustness of large language models in understanding code semantics, revealing significant brittleness under input perturbations despite high original accuracy, and explores remedies to improve exception handling.
Contribution
It provides a comparative analysis of open-source and frontier LLMs' robustness in code understanding and proposes methods to enhance exception prediction capabilities.
Findings
Open-source models maintain stable accuracy under code perturbations.
GPT-5.2's accuracy drops by 20-24% when inputs are perturbed.
Models perform worse on inputs that raise exceptions, depending on exception type.
Abstract
LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program-output prediction task. Our results reveal a stark divergence in model behavior: while open-source reasoning models (DeepSeek-R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67%) under code transformations & input perturbations, the frontier model GPT-5.2 exhibits significant brittleness. Despite achieving a near-perfect score of 99% on the original, unperturbed CRUXEval benchmark, perturbed inputs trigger accuracy declines between 20% and 24%. In addition, we find that many models perform much worse at predicting behavior on perturbed inputs that raise exceptions, and that prediction performance depends on the kind…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
