Evaluating LLMs Code Reasoning Under Real-World Context
Changshu Liu

TL;DR
This paper introduces R2Eval, a new benchmark with 135 real-world Python code reasoning problems that better reflect practical complexity and data structures for evaluating LLMs.
Contribution
It presents R2Eval, a benchmark that captures real-world data complexity and project dependencies, improving the assessment of LLMs' code reasoning abilities.
Findings
R2Eval includes 135 problems from ten Python projects.
It serializes complex and custom data types for realistic evaluation.
The benchmark enables more accurate measurement of LLMs' practical code reasoning.
Abstract
Code reasoning tasks are increasingly crucial to evaluating large language models (LLMs). Yet most existing benchmarks rely on simplistic, LLM-generated snippets or human-written solutions to code challenges and often restrict inputs and outputs to primitive types, failing to reflect the structure and dependencies of real-world projects. These simplifications limit their ability to measure practical generalizability. We present R2Eval1, a benchmark of 135 code reasoning problems drawn from ten widely used Python projects. Unlike prior work, R2Eval serializes compound and custom types, preserving real-world data complexity and enabling a more realistic assessment of LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
