An Alternative Softmax Operator for Reinforcement Learning
Kavosh Asadi, Michael L. Littman

TL;DR
This paper introduces a new differentiable softmax operator for reinforcement learning that ensures convergence and improves policy computation, addressing issues with the traditional Boltzmann softmax operator.
Contribution
The authors propose a novel softmax operator that guarantees convergence and demonstrate its effectiveness within a SARSA algorithm with a state-dependent temperature.
Findings
The new operator is a non-expansion, ensuring convergence in learning and planning.
The SARSA variant with the new operator converges and performs well in practice.
The operator reduces misbehavior compared to the traditional Boltzmann softmax.
Abstract
A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
An Alternative Softmax Operator for Reinforcement Learning· youtube
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Risk and Portfolio Optimization
MethodsSoftmax
