Observe and Look Further: Achieving Consistent Performance on Atari

Tobias Pohlen; Bilal Piot; Todd Hester; Mohammad Gheshlaghi Azar; Dan; Horgan; David Budden; Gabriel Barth-Maron; Hado van Hasselt; John Quan; Mel; Ve\v{c}er\'ik; Matteo Hessel; R\'emi Munos; Olivier Pietquin

arXiv:1805.11593·cs.LG·May 30, 2018·85 cites

Observe and Look Further: Achieving Consistent Performance on Atari

Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan, Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel, Ve\v{c}er\'ik, Matteo Hessel, R\'emi Munos, Olivier Pietquin

PDF

Open Access

TL;DR

This paper introduces a new deep RL algorithm that overcomes key challenges like reward processing, long-term reasoning, and exploration, achieving human-level performance on most Atari games and solving Montezuma's Revenge's first level.

Contribution

The paper presents a novel algorithm with a transformed Bellman operator, an auxiliary loss for stability, and human demonstrations to improve exploration in Atari games.

Findings

01

Exceeds average human performance on 40 out of 42 Atari games.

02

First deep RL algorithm to solve the first level of Montezuma's Revenge.

03

Effectively handles diverse reward scales and long-term planning.

Abstract

Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently. In this paper, we propose an algorithm that addresses each of these challenges and is able to learn human-level policies on nearly all Atari games. A new transformed Bellman operator allows our algorithm to process rewards of varying densities and scales; an auxiliary temporal consistency loss allows us to train stably using a discount factor of $γ = 0.999$ (instead of $γ = 0.99$ ) extending the effective planning horizon by an order of magnitude; and we ease the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Artificial Intelligence in Games