CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
Danning Xie, Mingwei Zheng, Xuwei Liu, Jiannan Wang, Chengpeng Wang, Lin Tan, Xiangyu Zhang

TL;DR
This paper introduces CORE, a benchmark for evaluating large language models' ability to perform static analysis tasks in code, revealing strengths and weaknesses in semantic reasoning across multiple programming languages.
Contribution
The paper presents a new, human-verified benchmark with diverse static analysis tasks and a sampling strategy to evaluate LLMs' code reasoning capabilities across different languages.
Findings
LLMs excel at dependency identification but struggle with deep semantic understanding.
Models face challenges with complex control structures and backward dependencies.
The benchmark reveals specific areas for improving LLMs' reasoning in code analysis.
Abstract
Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated, leaving the models' ability for program semantic reasoning underexplored. This work presents CORE, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CORE includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability
