CoCoNUT: Structural Code Understanding does not fall out of a tree

Claas Beger; Saikat Dutta

arXiv:2501.16456·cs.LG·March 5, 2025

CoCoNUT: Structural Code Understanding does not fall out of a tree

Claas Beger, Saikat Dutta

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper evaluates the ability of state-of-the-art Large Language Models to understand and trace structural control flow in code, revealing significant limitations despite high performance on standard benchmarks.

Contribution

The authors introduce CoCoNUT, a dataset and benchmark specifically designed to assess code control flow understanding in LLMs, highlighting gaps in current models' reasoning capabilities.

Findings

01

Models perform poorly on control flow tracing, especially for complex structures.

02

Even top models like Gemini correctly generate only 47% of execution traces.

03

Specialized structures like OOP and recursion are poorly understood by current models.

Abstract

Large Language Models (LLMs) have shown impressive performance across a wide array of tasks involving both structured and unstructured textual data. Recent results on various benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans. In this work, we demonstrate that high performance on such benchmarks does not correlate to humans' innate ability to understand structural control flow in code. To this end, we extract solutions from the HumanEval benchmark, which the relevant models perform strongly on, and trace their execution path using function calls sampled from the respective test set. Using this dataset, we investigate the ability of seven state-of-the-art LLMs to match the execution trace and find that, despite their ability to generate semantically identical code, they possess limited ability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ClaasBeger/StructuralCodeUnderstanding
noneOfficial

Datasets

ClaasBeger/CoCoNUT
dataset· 56 dl
56 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Software Testing and Debugging Techniques