Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

Felix M\"achtle; Jan-Niclas Serr; Nils Loose; Thomas Eisenbarth

arXiv:2601.12951·cs.SE·January 21, 2026

Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

Felix M\"achtle, Jan-Niclas Serr, Nils Loose, Thomas Eisenbarth

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a diagnostic framework to evaluate large language models' code comprehension, revealing that their performance aligns poorly with traditional software complexity metrics and instead reflects model-specific patterns.

Contribution

It proposes a novel binary input-output consistency framework for assessing code understanding and demonstrates that LLM success is only weakly correlated with human-centric complexity measures.

Findings

01

Minimal correlation between human metrics and LLM performance (AUROC 0.63)

02

Shadow models predict LLM success with higher accuracy (AUROC 0.86)

03

LLM comprehension captures complex, non-human regularities

Abstract

Large Language Models (LLMs) are increasingly integrated into software engineering workflows, yet current benchmarks provide only coarse performance summaries that obscure the diverse capabilities and limitations of these models. This paper investigates whether LLMs' code-comprehension performance aligns with traditional human-centric software metrics or instead reflects distinct, non-human regularities. We introduce a diagnostic framework that reframes code understanding as a binary input-output consistency task, enabling the evaluation of classification and generative models. Using a large-scale dataset, we correlate model performance with traditional, human-centric complexity metrics, such as lexical size, control-flow complexity, and abstract syntax tree structure. Our analyses reveal minimal correlation between human-defined metrics and LLM success (AUROC 0.63), while shadow models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Felix6326727/beyond-accuracy-code-comprehension
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Software System Performance and Reliability