Learning One Representation to Optimize All Rewards

Ahmed Touati; Yann Ollivier

arXiv:2103.07945·cs.LG·October 12, 2021·5 cites

Learning One Representation to Optimize All Rewards

Ahmed Touati, Yann Ollivier

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a forward-backward representation for reward-free Markov decision processes, enabling near-optimal policies for any reward after unsupervised learning, without planning, and demonstrating adaptability to new tasks.

Contribution

It presents a novel FB representation that learns long-range state-action relationships and allows immediate policy adaptation for arbitrary rewards without planning.

Findings

01

Achieves near-optimal policies for various rewards after unsupervised training.

02

Performs well on discrete and continuous maze tasks, pixel-based MsPacman, and robot arm control.

03

Provides a principled unsupervised loss with provable optimality under perfect training.

Abstract

We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori. During an unsupervised phase, we use reward-free interactions with the environment to learn two representations via off-the-shelf deep learning methods and temporal difference (TD) learning. In the test phase, a reward representation is estimated either from observations or an explicit reward description (e.g., a target state). The optimal policy for that reward is directly obtained from these representations, with no planning. We assume access to an exploration scheme or replay buffer for the first phase. The corresponding unsupervised loss is well-principled: if training is perfect, the policies obtained are provably optimal for any reward function. With imperfect training, the sub-optimality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Learning One Representation to Optimize All Rewards· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning