Information asymmetry in KL-regularized RL
Alexandre Galashov, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva, Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee, Whye Teh, Razvan Pascanu, Nicolas Heess

TL;DR
This paper explores a novel approach in reinforcement learning where a learned default policy with limited information capacity accelerates and enhances learning by leveraging repeated structures in the environment.
Contribution
It introduces a method to learn a default policy constrained by information limits within KL-regularized RL, connecting it to information bottleneck and variational EM frameworks.
Findings
Learning a default policy speeds up training.
Restricting information flow improves policy reuse.
Empirical results show faster convergence in various domains.
Abstract
Many real world tasks exhibit rich structure that is repeated across different parts of the state space or in time. In this work we study the possibility of leveraging such repeated structure to speed up and regularize learning. We start from the KL regularized expected reward objective which introduces an additional component, a default policy. Instead of relying on a fixed default policy, we learn it from data. But crucially, we restrict the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster. We formalize this strategy and discuss connections to information bottleneck approaches and to the variational EM algorithm. We present empirical results in both discrete and continuous action domains and demonstrate that, for certain tasks, learning a default policy alongside the policy can significantly speed up and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Reinforcement Learning in Robotics · Machine Learning and Algorithms
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
