StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
Hao Wang, Rui Li, Lei Sha, Jie M. Zhang

TL;DR
StepCodeReasoner introduces explicit intermediate execution supervision and a reinforcement learning algorithm to improve code reasoning accuracy by modeling stepwise execution traces.
Contribution
It presents a novel framework that incorporates intermediate execution states and a bi-level RL method for structured credit assignment, achieving state-of-the-art results.
Findings
Achieved 91.1% on CRUXEval with a 7B model.
Outperformed baselines on LiveCodeBench and REval benchmarks.
Improved code generation performance through explicit execution modeling.
Abstract
Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
