An Alternative Softmax Operator for Reinforcement Learning

Kavosh Asadi; Michael L. Littman

arXiv:1612.05628·cs.AI·June 15, 2017·26 cites

An Alternative Softmax Operator for Reinforcement Learning

Kavosh Asadi, Michael L. Littman

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper introduces a new differentiable softmax operator for reinforcement learning that ensures convergence and improves policy computation, addressing issues with the traditional Boltzmann softmax operator.

Contribution

The authors propose a novel softmax operator that guarantees convergence and demonstrate its effectiveness within a SARSA algorithm with a state-dependent temperature.

Findings

01

The new operator is a non-expansion, ensuring convergence in learning and planning.

02

The SARSA variant with the new operator converges and performs well in practice.

03

The operator reduces misbehavior compared to the traditional Boltzmann softmax.

Abstract

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

An Alternative Softmax Operator for Reinforcement Learning· youtube

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Risk and Portfolio Optimization

MethodsSoftmax