Reinforcement Learning With Reward Machines in Stochastic Games
Jueming Hu, Jean-Raphael Gaglione, Yanze Wang, Zhe Xu, Ufuk Topcu, and, Yongming Liu

TL;DR
This paper introduces Q-learning with reward machines for stochastic games, enabling multi-agent systems to learn Nash equilibrium strategies in complex, non-Markovian reward environments with proven convergence properties.
Contribution
It develops a novel algorithm, QRM-SG, that incorporates reward machines into multi-agent reinforcement learning for stochastic games, with convergence guarantees to Nash equilibrium.
Findings
QRM-SG effectively learns best-response strategies in complex stochastic games.
QRM-SG converges faster than baseline methods like Nash Q-learning and MADDPG.
The algorithm demonstrates successful convergence in three case studies.
Abstract
We investigate multi-agent reinforcement learning for stochastic games with complex tasks, where the reward functions are non-Markovian. We utilize reward machines to incorporate high-level knowledge of complex tasks. We develop an algorithm called Q-learning with reward machines for stochastic games (QRM-SG), to learn the best-response strategy at Nash equilibrium for each agent. In QRM-SG, we define the Q-function at a Nash equilibrium in augmented state space. The augmented state space integrates the state of the stochastic game and the state of reward machines. Each agent learns the Q-functions of all agents in the system. We prove that Q-functions learned in QRM-SG converge to the Q-functions at a Nash equilibrium if the stage game at each time step during learning has a global optimum point or a saddle point, and the agents update Q-functions based on the best-response strategy at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
Methods*Communicated@Fast*How Do I Communicate to Expedia? · fail · Batch Normalization · Adam · Convolution · Dense Connections · Weight Decay · Q-Learning · Experience Replay · MADDPG
