Data Deletion Can Help in Adaptive RL
Param Budhraja, Aditya Gangrade, Alex Olshevsky, Venkatesh Saligrama

TL;DR
This paper demonstrates that random data deletion during training improves the robustness of reinforcement learning policies in time-varying environments by implicitly managing data distribution mismatch.
Contribution
It introduces a simple data deletion trick that enhances estimator robustness and provides theoretical analysis of when deletion is beneficial under distribution mismatch.
Findings
Data deletion reduces robustness gap by 30% for MLPs.
Deletion allows smaller models to outperform larger ones trained without deletion.
Theoretical analysis shows deletion helps when the distribution mismatch and SNR are sufficiently low.
Abstract
Deploying reinforcement learning policies in the real world requires adapting to time-varying environments. We study this problem in the contextual Markov Decision Process (cMDP) framework, where a family of environments is indexed by a low-dimensional context unknown at test time. The standard approach decomposes the problem: train a so-called "universal policy" which assumes knowledge of the true context, then pair it with a context estimator which approximates context using the observed trajectory. We identify a simple, counterintuitive trick that substantially improves the estimator: randomly delete a fraction of the training buffer after each round. This works because data is collected across multiple rounds using progressively better policies, and older trajectories come from a different distribution than what the estimator will face at deployment time; random deletion creates an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
