Learning from Language Feedback via Variational Policy Distillation
Yang Li, Erik Nijkamp, Semih Yavuz, Shafiq Joty

TL;DR
This paper introduces Variational Policy Distillation (VPD), a novel framework that actively refines both teacher and student policies using language feedback, significantly improving learning in complex reasoning tasks with sparse rewards.
Contribution
VPD formalizes learning from language feedback as a Variational EM problem, enabling dynamic teacher improvement and overcoming passive distillation limitations.
Findings
VPD outperforms standard RLVR and self-distillation baselines across multiple tasks.
Active teacher refinement enhances the extraction of actionable signals from textual feedback.
VPD demonstrates robustness in mathematical reasoning and cold-start scenarios.
Abstract
Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
