Offline Behavior Distillation
Shiye Lei, Sen Zhang, Dacheng Tao

TL;DR
This paper introduces Offline Behavior Distillation (OBD), a method to efficiently synthesize expert-like data from sub-optimal offline RL data, enabling faster policy learning with theoretical guarantees and improved empirical performance.
Contribution
The paper proposes Av-PBC, an improved OBD objective with linear discount complexity, and provides theoretical analysis and extensive experiments demonstrating its effectiveness.
Findings
Av-PBC achieves superior distillation performance.
It converges faster than naive methods.
It generalizes well across architectures and optimizers.
Abstract
Massive reinforcement learning (RL) data are typically collected to train policies offline without the need for interactions, but the large data volume can cause training inefficiencies. To tackle this issue, we formulate offline behavior distillation (OBD), which synthesizes limited expert behavioral data from sub-optimal RL data, enabling rapid policy learning. We propose two naive OBD objectives, DBC and PBC, which measure distillation performance via the decision difference between policies trained on distilled data and either offline data or a near-expert policy. Due to intractable bi-level optimization, the OBD objective is difficult to minimize to small values, which deteriorates PBC by its distillation performance guarantee with quadratic discount complexity . We theoretically establish the equivalence between the policy performance and action-value…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWater Quality Monitoring Technologies
