# Reward Prediction Error as an Exploration Objective in Deep RL

**Authors:** Riley Simmons-Edler, Ben Eisner, Daniel Yang, Anthony Bisulco, Eric, Mitchell, Sebastian Seung, Daniel Lee

arXiv: 1906.08189 · 2021-01-15

## TL;DR

This paper introduces QXplore, a deep RL exploration method that maximizes reward prediction error, effectively addressing exploration challenges especially when state novelty does not correlate with reward improvement.

## Contribution

The paper proposes a novel exploration objective based on reward prediction error and a deep RL algorithm, QXplore, for high-dimensional environments, outperforming state-novelty methods in certain tasks.

## Key findings

- QXplore performs comparably or better than state-novelty methods.
- QXplore excels in environments where state novelty does not correlate with reward.
- QXplore effectively solves hard exploration tasks in high-dimensional MDPs.

## Abstract

A major challenge in reinforcement learning is exploration, when local dithering methods such as epsilon-greedy sampling are insufficient to solve a given task. Many recent methods have proposed to intrinsically motivate an agent to seek novel states, driving the agent to discover improved reward. However, while state-novelty exploration methods are suitable for tasks where novel observations correlate well with improved reward, they may not explore more efficiently than epsilon-greedy approaches in environments where the two are not well-correlated. In this paper, we distinguish between exploration tasks in which seeking novel states aids in finding new reward, and those where it does not, such as goal-conditioned tasks and escaping local reward maxima. We propose a new exploration objective, maximizing the reward prediction error (RPE) of a value function trained to predict extrinsic reward. We then propose a deep reinforcement learning method, QXplore, which exploits the temporal difference error of a Q-function to solve hard exploration tasks in high-dimensional MDPs. We demonstrate the exploration behavior of QXplore on several OpenAI Gym MuJoCo tasks and Atari games and observe that QXplore is comparable to or better than a baseline state-novelty method in all cases, outperforming the baseline on tasks where state novelty is not well-correlated with improved reward.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.08189/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/1906.08189/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/1906.08189/full.md

---
Source: https://tomesphere.com/paper/1906.08189