Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
Akiyoshi Tomihari, Issei Sato

TL;DR
This paper reveals how power sampling, RL, and self-distillation are interconnected through the power distribution, enabling more efficient training and inference in large language models.
Contribution
It introduces power self-distillation, linking sampling, RL, and distillation via the power distribution, and demonstrates its effectiveness in reasoning tasks.
Findings
Power distribution connects sampling, RL, and distillation.
Power self-distillation can match power sampling performance with lower inference cost.
Power sampling increases self-reward, with true reward improvements depending on reward alignment.
Abstract
Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in large language models (LLMs). At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
