CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks
Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Zongyu Wang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu

TL;DR
CoreCodeBench is a novel, fine-grained, repository-level benchmark for evaluating code intelligence in LLMs, enabling detailed diagnosis of specific cognitive skills and addressing limitations of existing coarse-grained static benchmarks.
Contribution
It introduces CoreCodeBench and CorePipe, providing a configurable, high-quality, atomized task suite that isolates cognitive demands and supports difficulty scaling for better model evaluation.
Findings
LLMs show significant capability misalignment across cognitive dimensions.
Fine-grained evaluation reveals strengths and weaknesses not visible in coarse metrics.
CoreCodeBench achieves higher data validity and robustness compared to existing benchmarks.
Abstract
The evaluation of Large Language Models (LLMs) for software engineering has shifted towards complex, repository-level tasks. However, existing benchmarks predominantly rely on coarse-grained pass rates that treat programming proficiency as a monolithic capability, obscuring specific cognitive bottlenecks. Furthermore, the static nature of these benchmarks renders them vulnerable to data contamination and performance saturation. To address these limitations, we introduce CoreCodeBench, a configurable repository-level benchmark designed to dissect coding capabilities through atomized tasks. Leveraging our automated framework, CorePipe, we extract and transform Python repositories into a comprehensive suite of tasks that isolate distinct cognitive demands within identical code contexts. Unlike static evaluations, CoreCodeBench supports controllable difficulty scaling to prevent saturation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning and Data Classification · Software System Performance and Reliability
MethodsFocus
