Extreme Q-Learning: MaxEnt RL without Entropy

Divyansh Garg; Joey Hejna; Matthieu Geist; Stefano Ermon

arXiv:2301.02328·cs.LG·March 2, 2023·5 cites

Extreme Q-Learning: MaxEnt RL without Entropy

Divyansh Garg, Joey Hejna, Matthieu Geist, Stefano Ermon

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper introduces Extreme Q-Learning, a novel approach that models the maximal Q-value directly using Extreme Value Theory, enabling effective MaxEnt RL without policy sampling, and demonstrates strong empirical performance.

Contribution

It presents the first offline MaxEnt Q-learning algorithms that do not require policy or entropy estimation, improving over prior methods by directly modeling maximal Q-values with EVT.

Findings

01

Outperforms prior methods by 10+ points on Franka Kitchen tasks

02

Achieves moderate improvements over SAC and TD3 on DM Control tasks

03

Demonstrates strong performance in D4RL benchmarks

Abstract

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our \emph{Extreme Q-Learning} framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Extreme Q-Learning: MaxEnt RL without Entropy· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning

MethodsConvolution · Global Average Pooling · Adam · Target Policy Smoothing · 1x1 Convolution · Average Pooling · Clipped Double Q-learning · Experience Replay · Dilated Convolution · Switchable Atrous Convolution