CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning
Monoshi Kumar Roy, Simin Chen, Benjamin Steenhoek, Jinjun Peng, Gail Kaiser, Baishakhi Ray, Wei Le

TL;DR
CodeSense introduces a comprehensive benchmark and dataset for fine-grained code semantic reasoning on real-world software projects, revealing current LLM limitations and providing tools for future research in software engineering tasks.
Contribution
It presents the first real-world code reasoning benchmark with a dataset from actual repositories, along with an execution tracing framework for future SE research.
Findings
State-of-the-art LLMs struggle with fine-grained code reasoning.
Prompting techniques improve but do not fully bridge the reasoning gap.
The dataset and tools facilitate future development of SE-focused models.
Abstract
Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on…
Peer Reviews
Decision·ICLR 2026 Poster
- The benchmark is interesting, addressing the weaknesses of previous benchmarks (CruxEval, REval, CodeMind). - The authors study a diverse array of code reasoning and program analysis tasks. RQ3 and RQ5 are particularly interesting and have not been studied before (to my knowledge). Many other RQ's are also studied from a fresh perspective, and there is some analysis for each one (some extended in the Appendix) - A wide variety of models are evaluated and studied, and there is a lot of room for
- The models studied in the paper are generally weaker than those on the frontier line, even taking into account the lag between the review date and the ICLR submission deadline. The paper would be stronger if it drew insights from failure modes of today's frontier models such as GPT-5, Gemini-2.5-Pro as well as open models like DeepSeek-R1, Qwen3. This would differentiate which of the paper's findings still holds true for the strongest models. - The research questions are interesting and open u
1. Important problem and motivation: Fine-grained semantic reasoning is crucial for real-world SE tasks like test generation, vulnerability detection, and bug repair. The motivation examples in Figure 1 effectively illustrate this. 2. The paper provides clear presentation and informative figures. 3.Releasing the framework, dataset, and leaderboard supports reproducibility and future work.
1. Missing Direct Comparison with Existing Benchmarks The paper claims CodeSense provides advantages over existing benchmarks (Table 1) but provides no experimental validation. Table 1 only compares features (Real-world Projects, Fine-grained Reasoning) without demonstrating whether these features translate to better evaluation quality. The 14 models should be evaluated on both CodeSense and existing benchmarks (CruxEval, REval, CodeMind) to show: (1) whether CodeSense reveals insights that othe
N/A
N/A
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software Engineering Techniques and Practices
MethodsFocus · Sparse Evolutionary Training
