Distilling Reinforcement Learning into Single-Batch Datasets

Connor Wilhelm; Dan Ventura

arXiv:2508.09283·cs.LG·August 14, 2025

Distilling Reinforcement Learning into Single-Batch Datasets

Connor Wilhelm, Dan Ventura

PDF

3 Reviews

TL;DR

This paper introduces a method to compress reinforcement learning environments into small, synthetic supervised datasets, enabling rapid training and cross-modality learning from RL to supervised learning.

Contribution

It extends dataset distillation to reinforcement learning, transforming RL tasks into one-batch supervised datasets and demonstrating its generalizability across environments and architectures.

Findings

01

RL environments can be distilled into single-batch datasets

02

Distilled datasets enable one-step training approximating original RL performance

03

Method generalizes across different RL environments and learner architectures

Abstract

Dataset distillation compresses a large dataset into a small synthetic dataset such that learning on the synthetic dataset approximates learning on the original. Training on the distilled dataset can be performed in as little as one step of gradient descent. We demonstrate that distillation is generalizable to different tasks by distilling reinforcement learning environments into one-batch supervised learning datasets. This demonstrates not only distillation's ability to compress a reinforcement learning task but also its ability to transform one learning modality (reinforcement learning) into another (supervised learning). We present a novel extension of proximal policy optimization for meta-learning and use it in distillation of a multi-dimensional extension of the classic cart-pole problem, all MuJoCo environments, and several Atari games. We demonstrate distillation's ability to…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 2

Strengths

Distillation seems like an interesting technique to reduce the data requirement of reinforcement learning.

Weaknesses

I vote to reject primarily because the motivation for the algorithm and it's empirical evaluation is difficult to follow. I had a difficult time understanding the core takeaways of this paper. 1. Many of the contributions listed can be combined. For instance, contributions 1, 3, 5 and 6 are essentially saying the same thing: this works propose a new distillation technique and demonstrates its effectiveness empirically. 2. Contribution 2 doesn't seem like a contribution; it's simply a task tha

Reviewer 02Rating 8Confidence 4

Strengths

Overall the paper is well-written. It provides an clearly-defined algorithm with training graph and pseudocode. It provides a simple algorithm based on PPO to distill RL environments into a parameterized distiller. The performance results from an easy task to complex tasks are on par with direct RL training which demonstrates the generalizability and high distillation performance of the algorithm.

Weaknesses

* The experiments do not cover the continuous control problems which are also important part of RL environments. Demonstrating distillation on those tasks can greatly benefits to the community as many robot experiments are under continuous action space. * If I understand correctly, the final baseline RL agent is determined by the time limit and convergence. But I would image using the same amount of training sample as in the distillation's outer loop for a more fair comparison. * The cost saving

Reviewer 03Rating 6Confidence 4

Strengths

* Work is novel considering it is probably the first to introduce dataset distillation for online RL. However, there is a related work on dataset distillation for offline RL that is missing in related work and I believe should be discussed.(https://arxiv.org/abs/2407.20299) * Results are promising in both Cartpole and atari experiments * Framework does not introduce a significant overhead to the wall-time of the PPO.

Weaknesses

* One of the main issues of this paper is motivation. Dataset distillation is proposed so that a large dataset can be condensed into a synthetic smaller one however I dont think the same analogy is reasonable for RL. Firstly, when you train a supervised model with large datasets, you would get almost the same accuracy which means that this dataset would have a score equivalence given a model whereas this is not accurate for the RL environments because RL environments are not deterministic. So, R

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.