CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

Jun Gao; Yun Peng; Qian Qiao; Changhai Zhou; Yuhua Zhou; Shiyang Zhang; Shichao Weng; Zhenchang Xing; Xiaoxue Ren

arXiv:2604.25399·cs.SE·April 29, 2026

CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

Jun Gao, Yun Peng, Qian Qiao, Changhai Zhou, Yuhua Zhou, Shiyang Zhang, Shichao Weng, Zhenchang Xing, Xiaoxue Ren

PDF

1 Repo

TL;DR

CoRE is a new benchmark designed to evaluate the true reasoning capabilities of large language models in code understanding, focusing on implementation invariance and process transparency beyond just output correctness.

Contribution

It introduces a comprehensive code reasoning benchmark that exposes limitations in current LLMs' robustness and understanding of intermediate execution states.

Findings

01

Models show significant performance variation across equivalent implementations.

02

Models often produce correct outputs without understanding intermediate states.

03

Output-only evaluations are insufficient for assessing code reasoning.

Abstract

Despite strong performance on code generation tasks, it remains unclear whether large language models (LLMs) genuinely reason about code execution. Existing code reasoning benchmarks primarily evaluate final output correctness under a single canonical implementation, leaving two critical aspects underexplored: (1) whether LLMs can maintain consistency to functionally equivalent implementations, and (2) whether LLMs can accurately reason about intermediate execution states. We introduce \textbf{CoRE}, a \textbf{Co}de \textbf{Re}asoning benchmark that evaluates code reasoning through \textbf{implementation invariance} and \textbf{process transparency}. Extensive evaluations on eight frontier LLMs reveal two fundamental limitations. First, models exhibit a substantial \textbf{robustness gap}, with performance varying significantly across equivalent implementations. Second, we observe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZJUSig/CoRE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.