CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks

Danning Xie; Mingwei Zheng; Xuwei Liu; Jiannan Wang; Chengpeng Wang; Lin Tan; Xiangyu Zhang

arXiv:2507.05269·cs.SE·January 21, 2026

CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks

Danning Xie, Mingwei Zheng, Xuwei Liu, Jiannan Wang, Chengpeng Wang, Lin Tan, Xiangyu Zhang

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces CORE, a benchmark for evaluating large language models' ability to perform static analysis tasks in code, revealing strengths and weaknesses in semantic reasoning across multiple programming languages.

Contribution

The paper presents a new, human-verified benchmark with diverse static analysis tasks and a sampling strategy to evaluate LLMs' code reasoning capabilities across different languages.

Findings

01

LLMs excel at dependency identification but struggle with deep semantic understanding.

02

Models face challenges with complex control structures and backward dependencies.

03

The benchmark reveals specific areas for improving LLMs' reasoning in code analysis.

Abstract

Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated, leaving the models' ability for program semantic reasoning underexplored. This work presents CORE, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CORE includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lt-asset/CoRe
dataset· 30 dl
30 dl

Videos

CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks· slideslive

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability