Offline Behavior Distillation

Shiye Lei; Sen Zhang; Dacheng Tao

arXiv:2410.22728·cs.LG·October 31, 2024

Offline Behavior Distillation

Shiye Lei, Sen Zhang, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

This paper introduces Offline Behavior Distillation (OBD), a method to efficiently synthesize expert-like data from sub-optimal offline RL data, enabling faster policy learning with theoretical guarantees and improved empirical performance.

Contribution

The paper proposes Av-PBC, an improved OBD objective with linear discount complexity, and provides theoretical analysis and extensive experiments demonstrating its effectiveness.

Findings

01

Av-PBC achieves superior distillation performance.

02

It converges faster than naive methods.

03

It generalizes well across architectures and optimizers.

Abstract

Massive reinforcement learning (RL) data are typically collected to train policies offline without the need for interactions, but the large data volume can cause training inefficiencies. To tackle this issue, we formulate offline behavior distillation (OBD), which synthesizes limited expert behavioral data from sub-optimal RL data, enabling rapid policy learning. We propose two naive OBD objectives, DBC and PBC, which measure distillation performance via the decision difference between policies trained on distilled data and either offline data or a near-expert policy. Due to intractable bi-level optimization, the OBD objective is difficult to minimize to small values, which deteriorates PBC by its distillation performance guarantee with quadratic discount complexity $O (1/ (1 - γ)^{2})$ . We theoretically establish the equivalence between the policy performance and action-value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leaveslei/obd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWater Quality Monitoring Technologies