Count-Based Temperature Scheduling for Maximum Entropy Reinforcement   Learning

Dailin Hu; Pieter Abbeel; Roy Fox

arXiv:2111.14204·cs.LG·November 30, 2021

Count-Based Temperature Scheduling for Maximum Entropy Reinforcement Learning

Dailin Hu, Pieter Abbeel, Roy Fox

PDF

Open Access

TL;DR

This paper introduces a state-dependent temperature scheduling method for MaxEnt RL, specifically instantiated as Count-Based Soft Q-Learning, which adapts the tradeoff coefficient during training to improve stability and performance.

Contribution

The paper proposes a novel count-based, state-dependent temperature schedule for MaxEnt RL, enhancing existing algorithms like SQL by dynamically adjusting the entropy tradeoff.

Findings

01

Improved training stability in toy and Atari domains

02

Enhanced policy robustness with adaptive temperature scheduling

03

Promising results demonstrating effectiveness of the approach

Abstract

Maximum Entropy Reinforcement Learning (MaxEnt RL) algorithms such as Soft Q-Learning (SQL) and Soft Actor-Critic trade off reward and policy entropy, which has the potential to improve training stability and robustness. Most MaxEnt RL methods, however, use a constant tradeoff coefficient (temperature), contrary to the intuition that the temperature should be high early in training to avoid overfitting to noisy value estimates and decrease later in training as we increasingly trust high value estimates to truly lead to good rewards. Moreover, our confidence in value estimates is state-dependent, increasing every time we use more evidence to update an estimate. In this paper, we present a simple state-based temperature scheduling approach, and instantiate it for SQL as Count-Based Soft Q-Learning (CBSQL). We evaluate our approach on a toy domain as well as in several Atari 2600 domains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Memory and Neural Computing

MethodsQ-Learning