TL;DR
This paper introduces TraceLift, a training framework that improves reasoning in language models by using executor-grounded rewards to evaluate and enhance the quality and usefulness of intermediate reasoning traces.
Contribution
It proposes a novel planner-executor training approach with a reasoning reward model, and introduces a new dataset for training and evaluating reasoning quality.
Findings
Executor-grounded rewards outperform execution-only training.
The approach improves reasoning quality in math and code benchmarks.
The dataset enables direct learning of reasoning quality.
Abstract
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
