TL;DR
This paper introduces Latent Exploration Decoding (LED), a method that restores effective exploration in large reasoning models post-training, leading to improved accuracy and reinforcement learning performance.
Contribution
LED is a novel, parameter-free decoding strategy that leverages intermediate layer entropy to enhance exploration in large reasoning models after training.
Findings
LED improves pass@1 accuracy by 0.61 percentage points.
LED enhances pass@16 accuracy by 1.03 percentage points.
Integrating LED into reinforcement learning accelerates reward improvement.
Abstract
Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
