CoCoNUT: Structural Code Understanding does not fall out of a tree
Claas Beger, Saikat Dutta

TL;DR
This paper evaluates the ability of state-of-the-art Large Language Models to understand and trace structural control flow in code, revealing significant limitations despite high performance on standard benchmarks.
Contribution
The authors introduce CoCoNUT, a dataset and benchmark specifically designed to assess code control flow understanding in LLMs, highlighting gaps in current models' reasoning capabilities.
Findings
Models perform poorly on control flow tracing, especially for complex structures.
Even top models like Gemini correctly generate only 47% of execution traces.
Specialized structures like OOP and recursion are poorly understood by current models.
Abstract
Large Language Models (LLMs) have shown impressive performance across a wide array of tasks involving both structured and unstructured textual data. Recent results on various benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans. In this work, we demonstrate that high performance on such benchmarks does not correlate to humans' innate ability to understand structural control flow in code. To this end, we extract solutions from the HumanEval benchmark, which the relevant models perform strongly on, and trace their execution path using function calls sampled from the respective test set. Using this dataset, we investigate the ability of seven state-of-the-art LLMs to match the execution trace and find that, despite their ability to generate semantically identical code, they possess limited ability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Software Testing and Debugging Techniques
