Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order
Prakhar Gupta, Vaibhav Gupta

TL;DR
This paper explores how injecting a canonical action order as a reward signal during RL post-training can improve solution quality, demonstrated on Zebra puzzles using a Transformer model.
Contribution
It introduces a method to incorporate an ordering reward during RL post-training, enhancing performance without changing the supervised data or model architecture.
Findings
Mixed rewards outperform task-only optimization in RL post-training.
Coarse ordering signals can steer models toward canonical trajectories.
Bootstrapped scaling helps balance reward components at initialization.
Abstract
Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during RL post-training, improves performance even when fine-tuned on randomized solution sequences. On Zebra puzzles, we fine-tune a Transformer on randomized solution orders, then post-train it with Group Relative Policy Optimization (GRPO) using two rewards: a sparse task reward that is 1 only when the puzzle is fully solved, and an ordering reward that increases when the model's emission order aligns with the canonical solver order. To compare signals cleanly, we combine them via fixed mixtures and use a simple bootstrapped scaling to equalize component magnitudes at initialization. Mixed rewards generally outperform task-only optimization, suggesting that coarse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
