CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

Lingyue Fu; Hao Guan; Bolun Zhang; Haowei Yuan; Yaoming Zhu; Jun Xu; Zongyu Wang; Lin Qiu; Xunliang Cai; Xuezhi Cao; Weiwen Liu; Weinan Zhang; Yong Yu

arXiv:2507.05281·cs.SE·January 8, 2026

CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Zongyu Wang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu

PDF

Open Access 1 Repo

TL;DR

CoreCodeBench is a novel, fine-grained, repository-level benchmark for evaluating code intelligence in LLMs, enabling detailed diagnosis of specific cognitive skills and addressing limitations of existing coarse-grained static benchmarks.

Contribution

It introduces CoreCodeBench and CorePipe, providing a configurable, high-quality, atomized task suite that isolates cognitive demands and supports difficulty scaling for better model evaluation.

Findings

01

LLMs show significant capability misalignment across cognitive dimensions.

02

Fine-grained evaluation reveals strengths and weaknesses not visible in coarse metrics.

03

CoreCodeBench achieves higher data validity and robustness compared to existing benchmarks.

Abstract

The evaluation of Large Language Models (LLMs) for software engineering has shifted towards complex, repository-level tasks. However, existing benchmarks predominantly rely on coarse-grained pass rates that treat programming proficiency as a monolithic capability, obscuring specific cognitive bottlenecks. Furthermore, the static nature of these benchmarks renders them vulnerable to data contamination and performance saturation. To address these limitations, we introduce CoreCodeBench, a configurable repository-level benchmark designed to dissect coding capabilities through atomized tasks. Leveraging our automated framework, CorePipe, we extract and transform Python repositories into a comprehensive suite of tasks that isolate distinct cognitive demands within identical code contexts. Unlike static evaluations, CoreCodeBench supports controllable difficulty scaling to prevent saturation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

agi-eval-official/corecodebench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Machine Learning and Data Classification · Software System Performance and Reliability

MethodsFocus