Behaviour Distillation
Andrei Lupu, Chris Lu, Jarek Liesen, Robert Tjarko Lange, Jakob, Foerster

TL;DR
This paper introduces behaviour distillation, a novel method for condensing reinforcement learning policies into small synthetic datasets, enabling efficient training, generalization, and interpretability across various tasks.
Contribution
It formalizes behaviour distillation for RL and proposes HaDES, a method that creates minimal datasets capable of training competitive policies without expert data.
Findings
HaDES discovers datasets of four state-action pairs that train agents effectively.
Synthetic datasets generalize across different architectures and hyperparameters.
Application to multi-task zero-shot training and improved neuroevolution results.
Abstract
Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning,…
Peer Reviews
Decision·ICLR 2024 poster
+ The introduction of behaviour distillation can enrich the literature and direction of Dataset Distillation. Different from standard DD, behaviour distillation does not require access to the expert datasets. This is quite close to a standard /basic RL setting and could be a good starting point. + The author's writing is easy to follow and pretty clear on the technical details + The experimental section show promising results using the proposed algorithm HADES.
- Although it is interesting, the proposed behaviour distillation seems to not have a clear motivation on why it can be useful (what's the motivation for proposing this problem and it's potential application, besides DD hasn't been applied to RL), and why not directly formulating the problem on expert dataset. Distilling directly from scratch can make the problem a lot harder. - It would be great if the authors can discuss the behaviour difference of a standard RL algorithm and a synthetic data-
This paper first introduces the dataset distillation into the RL setting, and the core pros of this paper are presented as below: 1. Formulate the behaviour distillation by incorporating dataset distillation with an RL reward function; 2. Propose a novel behaviour distillation algorithm of HaDES that learns a few (state, action) pairs to fastly train a policy network with supervised learning; 3. Multiple empirical studies are conducted to verify the effectiveness and robustness of the synthet
While the authors creatively incorporate dataset distillation into RL setting, there are some weaknesses mainly lie on the motivation and experiments, which hurt the contribution of this manuscript. **Q1:** Sec. 1 states that this work is "motivated by the challenge of behaviour distillation", while this (challenge) is not a clear motivation for developing the behaviour distillation. What is the advantages of using the distilled dataset in RL except for fast training? In my opinion, the networ
- Transfer the concept of dataset distillation to reinforcement learning - The condensed dataset can result in a quite small one - The results on supervised classification dataset distillation also seem good
- 1) I think the neuroevolution part is quite distinct from the dataset distillation part. It should be explained clearly why you need to use a neuroevolution technique instead of a more classical technique - 2) Maybe the naming is a bit confusing, if the method proposed is meant to distill the dataset only (and not train on it later), then it would be clearer to name the experiments as Method + HADES (whenever training a method on top of hallucinated dataset). - 3) If I understood correctly, on
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProcess Optimization and Integration
