Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models
Felix M\"achtle, Jan-Niclas Serr, Nils Loose, Thomas Eisenbarth

TL;DR
This paper introduces a diagnostic framework to evaluate large language models' code comprehension, revealing that their performance aligns poorly with traditional software complexity metrics and instead reflects model-specific patterns.
Contribution
It proposes a novel binary input-output consistency framework for assessing code understanding and demonstrates that LLM success is only weakly correlated with human-centric complexity measures.
Findings
Minimal correlation between human metrics and LLM performance (AUROC 0.63)
Shadow models predict LLM success with higher accuracy (AUROC 0.86)
LLM comprehension captures complex, non-human regularities
Abstract
Large Language Models (LLMs) are increasingly integrated into software engineering workflows, yet current benchmarks provide only coarse performance summaries that obscure the diverse capabilities and limitations of these models. This paper investigates whether LLMs' code-comprehension performance aligns with traditional human-centric software metrics or instead reflects distinct, non-human regularities. We introduce a diagnostic framework that reframes code understanding as a binary input-output consistency task, enabling the evaluation of classification and generative models. Using a large-scale dataset, we correlate model performance with traditional, human-centric complexity metrics, such as lexical size, control-flow complexity, and abstract syntax tree structure. Our analyses reveal minimal correlation between human-defined metrics and LLM success (AUROC 0.63), while shadow models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software System Performance and Reliability
