Generalized Munchausen Reinforcement Learning using Tsallis KL   Divergence

Lingwei Zhu; Zheng Chen; Matthew Schlegel; Martha White

arXiv:2301.11476·cs.LG·March 19, 2024

Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence

Lingwei Zhu, Zheng Chen, Matthew Schlegel, Martha White

PDF

Open Access

TL;DR

This paper introduces a generalized policy regularization method in reinforcement learning using Tsallis KL divergence, extending traditional KL approaches, and demonstrates its effectiveness through improved performance on Atari games.

Contribution

It proposes a novel reinforcement learning algorithm that incorporates Tsallis KL divergence, generalizing existing methods and showing empirical benefits over standard approaches.

Findings

01

Tsallis KL divergence generalizes standard KL with a parameter q.

02

Generalized MVI(q) outperforms standard MVI in Atari games.

03

Q > 1 can provide benefits in policy learning.

Abstract

Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$ -logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q > 1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDecision-Making and Behavioral Economics

MethodsTrust Region Policy Optimization