Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

Changshu Liu; Alireza Ghazanfari; Yang Chen; Reyhaneh Jabbarvand

arXiv:2512.14917·cs.SE·April 27, 2026

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

Changshu Liu, Alireza Ghazanfari, Yang Chen, Reyhaneh Jabbarvand

PDF

TL;DR

This paper introduces a dataset of 1200 real-world code reasoning problems, categorizing them by complexity to better evaluate large language models' reasoning abilities in practical settings.

Contribution

It presents a novel dataset with real-world complexities, using program analysis to serialize complex types and categorize problems by difficulty, improving evaluation realism.

Findings

01

Existing benchmarks mostly contain low-complexity problems.

02

The dataset captures diverse real-world code complexities.

03

Categorization reveals current benchmarks underestimate real-world difficulty.

Abstract

Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Yet, there is a dearth of studies on the impact of real-world complexities on code reasoning, e.g., inter- or intra-procedural dependencies, API calls, deeply nested constructs, and non-primitive complex types. Evaluating LLMs under such a simplistic setting poses a significant threat to assumptions about their generalizability in practice. To enable a more realistic evaluation of code reasoning, we construct a dataset of 1200 reasoning problems from two sources: existing code reasoning benchmarks and popular GitHub Python repositories. Our pipeline leverages static and dynamic program analysis to automatically serialize/deserialize compound, complex, and custom types galore in real-world code, going far beyond only primitive types used in prior studies. A key feature of our dataset is categorizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.