Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?
Pedro Orvalho, Marta Kwiatkowska

TL;DR
This paper investigates whether large language models understand code semantics by applying mutations that preserve program meaning but alter syntax, revealing significant fragility and flawed reasoning in current models.
Contribution
It introduces a systematic evaluation of LLM robustness to semantics-preserving code mutations, highlighting their unstable reasoning and reliance on superficial cues.
Findings
Proprietary models have higher accuracy and reasoning quality.
LLMs show 10-50% correct predictions based on flawed reasoning.
Performance drops up to 70% under code mutations.
Abstract
With the widespread adoption of vibe coding, understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies assess LLMs' ability to predict program outputs, most focus on accuracy alone, without evaluating the underlying reasoning. Moreover, it has been observed on mathematical reasoning tasks that LLMs can arrive at correct answers through flawed logic, raising concerns about similar issues in code understanding. In this paper we assess whether state-of-the-art LLMs can reason about Python programs or are simply guessing. We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling. These mutations maintain program semantics while altering its syntax. We evaluated nine LLMs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
