Examining LLMs Ability to Summarize Code Through Mutation-Analysis
Lara Khatib, Micheal Pu, Bogdan Vasilescu, Meiyappan Nagappan

TL;DR
This paper introduces a mutation-based evaluation method to assess if LLM-generated code summaries accurately reflect the code's actual behavior, revealing significant accuracy drops with increased complexity and highlighting model limitations.
Contribution
It presents a novel mutation-analysis approach for systematically testing the fidelity of LLM code summaries against actual program behavior.
Findings
Summary accuracy drops from 76.5% to 17.3% with increased complexity
Models often describe intent rather than mutated behavior, with 49.3% accuracy
GPT-5.2 outperforms GPT-4, achieving 85.3% accuracy
Abstract
As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code's logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation-summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Scientific Computing and Data Management
