Reasoning Runtime Behavior of a Program with LLM: How Far Are We?
Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia

TL;DR
This paper introduces REval, a framework for evaluating code LLMs' reasoning and consistency during program execution, revealing current models' performance is largely unsatisfactory and highlighting the need for improvement.
Contribution
It proposes REval, a novel framework for assessing code reasoning and consistency in code LLMs, and provides a large-scale empirical study using this framework.
Findings
Most LLMs achieve only 44.4% accuracy in runtime behavior reasoning.
Average Incremental Consistency score is only 10.3.
Current code LLMs need significant improvement in reasoning capabilities.
Abstract
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and ClassEval). Code reasoning is one of the most essential abilities of code LLMs, but existing benchmarks for code reasoning are not sufficient. Typically, they focus on predicting the input and output of a program, ignoring the evaluation of the intermediate behavior during program execution, as well as the logical consistency (e.g., the model should not give the correct output if the prediction of execution path is wrong) when performing the reasoning. To address these problems, in this paper, we propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution. We utilize existing code benchmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · Advanced Data Processing Techniques
MethodsFocus
