# In Hindsight: A Smooth Reward for Steady Exploration

**Authors:** Hadi S. Jomaa, Josif Grabocka, Lars Schmidt-Thieme

arXiv: 1906.09781 · 2019-06-25

## TL;DR

This paper introduces a hindsight factor in Q-learning that incorporates historical temporal differences, reducing overestimation errors and improving stability and performance in deterministic and ATARI game environments.

## Contribution

The paper proposes a novel hindsight factor that enhances Q-learning by integrating past temporal differences, leading to lower overestimation and better stability.

## Key findings

- Reduces overestimation errors in Q-learning.
- Outperforms DQN, DDQN, and Dueling networks on ATARI games.
- Improves stability and average rewards in deterministic environments.

## Abstract

In classical Q-learning, the objective is to maximize the sum of discounted rewards through iteratively using the Bellman equation as an update, in an attempt to estimate the action value function of the optimal policy. Conventionally, the loss function is defined as the temporal difference between the action value and the expected (discounted) reward, however it focuses solely on the future, leading to overestimation errors. We extend the well-established Q-learning techniques by introducing the hindsight factor, an additional loss term that takes into account how the model progresses, by integrating the historic temporal difference as part of the reward. The effect of this modification is examined in a deterministic continuous-state space function estimation problem, where the overestimation phenomenon is significantly reduced and results in improved stability. The underlying effect of the hindsight factor is modeled as an adaptive learning rate, which unlike existing adaptive optimizers, takes into account the previously estimated action value.   The proposed method outperforms variations of Q-learning, with an overall higher average reward and lower action values, which supports the deterministic evaluation, and proves that the hindsight factor contributes to lower overestimation errors. The mean average score of 100 episodes obtained after training for 10 million frames shows that the hindsight factor outperforms deep Q-networks, double deep Q-networks and dueling networks for a variety of ATARI games.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.09781/full.md

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/1906.09781/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/1906.09781/full.md

---
Source: https://tomesphere.com/paper/1906.09781