TL;DR
Proximal Policy Distillation (PPD) is a new method combining student-driven distillation with PPO, improving sample efficiency and robustness in reinforcement learning across various environments.
Contribution
The paper introduces PPD, a novel policy distillation technique that leverages PPO to enhance efficiency and robustness in policy transfer tasks.
Findings
PPD outperforms traditional distillation methods in sample efficiency.
PPD produces higher quality student policies across diverse environments.
PPD is more robust when distilling from imperfect demonstrations.
Abstract
We introduce Proximal Policy Distillation (PPD), a novel policy distillation method that integrates student-driven distillation and Proximal Policy Optimization (PPO) to increase sample efficiency and to leverage the additional rewards that the student policy collects during distillation. To assess the efficacy of our method, we compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
