Learning One Representation to Optimize All Rewards
Ahmed Touati, Yann Ollivier

TL;DR
This paper introduces a forward-backward representation for reward-free Markov decision processes, enabling near-optimal policies for any reward after unsupervised learning, without planning, and demonstrating adaptability to new tasks.
Contribution
It presents a novel FB representation that learns long-range state-action relationships and allows immediate policy adaptation for arbitrary rewards without planning.
Findings
Achieves near-optimal policies for various rewards after unsupervised training.
Performs well on discrete and continuous maze tasks, pixel-based MsPacman, and robot arm control.
Provides a principled unsupervised loss with provable optimality under perfect training.
Abstract
We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori. During an unsupervised phase, we use reward-free interactions with the environment to learn two representations via off-the-shelf deep learning methods and temporal difference (TD) learning. In the test phase, a reward representation is estimated either from observations or an explicit reward description (e.g., a target state). The optimal policy for that reward is directly obtained from these representations, with no planning. We assume access to an exploration scheme or replay buffer for the first phase. The corresponding unsupervised loss is well-principled: if training is perfect, the policies obtained are provably optimal for any reward function. With imperfect training, the sub-optimality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning
