RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning
Yukinari Hisaki, Isao Ono

TL;DR
This paper introduces RVI-SAC, an off-policy deep reinforcement learning algorithm that optimizes the average reward criterion, addressing discrepancies in continuing tasks and demonstrating competitive performance on Mujoco locomotion benchmarks.
Contribution
RVI-SAC extends Soft Actor-Critic to the average reward setting with novel critic and actor updates, enabling effective learning in continuing tasks.
Findings
RVI-SAC performs competitively on Mujoco locomotion tasks.
The method effectively incorporates average reward criterion into off-policy DRL.
Automatic adjustment of Reset Cost enhances applicability to termination tasks.
Abstract
In this paper, we propose an off-policy deep reinforcement learning (DRL) method utilizing the average reward criterion. While most existing DRL methods employ the discounted reward criterion, this can potentially lead to a discrepancy between the training objective and performance metrics in continuing tasks, making the average reward criterion a recommended alternative. We introduce RVI-SAC, an extension of the state-of-the-art off-policy DRL method, Soft Actor-Critic (SAC), to the average reward criterion. Our proposal consists of (1) Critic updates based on RVI Q-learning, (2) Actor updates introduced by the average reward soft policy improvement theorem, and (3) automatic adjustment of Reset Cost enabling the average reward reinforcement learning to be applied to tasks with termination. We apply our method to the Gymnasium's Mujoco tasks, a subset of locomotion tasks, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
