TL;DR
This paper introduces EntRGi, a novel entropy-aware reward guidance method for discrete diffusion language models that improves test-time adaptation and reinforcement learning performance.
Contribution
EntRGi dynamically balances continuous relaxations and hard token sampling using entropy, maintaining reward model reliability and optimization accuracy.
Findings
EntRGi outperforms existing methods on 7B-parameter models.
It improves test-time adaptation and reward-guided reinforcement learning.
Empirical results show consistent performance gains.
Abstract
Reward guidance, also known as posterior sampling, is a popular method for test-time adaptation and post-training in continuous diffusion models. In this paper, we study reward guidance for discrete diffusion language models; now, one cannot differentiate through the natural outputs of the model because they are discrete tokens. We introduce a novel mechanism called EntRGi (Entropy aware Reward Guidance) to address this issue. EntRGi dynamically interpolates between continuous token relaxations and sampled hard tokens, on a token-by-token basis, using the diffusion model's predictive entropy. We demonstrate that EntRGi maintains both reward model reliability and optimization accuracy, while existing approaches sacrifice one for the other. We empirically validate our approach on 7B-parameter diffusion language models across two settings: (1) test-time adaptation, and (2) RGRL (Reward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
