Actor-critic is implicitly biased towards high entropy optimal policies

Yuzheng Hu; Ziwei Ji; Matus Telgarsky

arXiv:2110.11280·cs.LG·March 15, 2022·1 cites

Actor-critic is implicitly biased towards high entropy optimal policies

Yuzheng Hu, Ziwei Ji, Matus Telgarsky

PDF

Open Access 1 Video

TL;DR

This paper demonstrates that a simple actor-critic algorithm inherently favors high entropy optimal policies, enabling convergence without explicit regularization, exploration, or mixing assumptions, even from a single trajectory.

Contribution

It reveals the implicit high entropy bias of actor-critic methods and introduces tools for analyzing mixing times and implicit regularization effects.

Findings

01

The algorithm prefers high entropy policies even without explicit regularization.

02

Convergence to optimal policies occurs without mixing assumptions or exploration strategies.

03

The analysis decouples actor and critic concerns using mirror descent and provides bounds on mixing times.

Abstract

We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like $ϵ$ -greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Actor-critic is implicitly biased towards high entropy optimal policies· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Model Reduction and Neural Networks · Adversarial Robustness in Machine Learning

MethodsSoftmax