Evaluating LLMs Code Reasoning Under Real-World Context

Changshu Liu

arXiv:2604.12881·cs.SE·April 15, 2026

Evaluating LLMs Code Reasoning Under Real-World Context

Changshu Liu

PDF

TL;DR

This paper introduces R2Eval, a new benchmark with 135 real-world Python code reasoning problems that better reflect practical complexity and data structures for evaluating LLMs.

Contribution

It presents R2Eval, a benchmark that captures real-world data complexity and project dependencies, improving the assessment of LLMs' code reasoning abilities.

Findings

01

R2Eval includes 135 problems from ten Python projects.

02

It serializes complex and custom data types for realistic evaluation.

03

The benchmark enables more accurate measurement of LLMs' practical code reasoning.

Abstract

Code reasoning tasks are increasingly crucial to evaluating large language models (LLMs). Yet most existing benchmarks rely on simplistic, LLM-generated snippets or human-written solutions to code challenges and often restrict inputs and outputs to primitive types, failing to reflect the structure and dependencies of real-world projects. These simplifications limit their ability to measure practical generalizability. We present R2Eval1, a benchmark of 135 code reasoning problems drawn from ten widely used Python projects. Unlike prior work, R2Eval serializes compound and custom types, preserving real-world data complexity and enabling a more realistic assessment of LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.