# Shapley Q-value: A Local Reward Approach to Solve Global Reward Games

**Authors:** Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, Yunjie Gu

arXiv: 1907.05707 · 2022-10-14

## TL;DR

This paper introduces Shapley Q-value, a local reward method based on cooperative game theory, to improve credit assignment and learning efficiency in multi-agent reinforcement learning with global rewards.

## Contribution

It proposes the extended convex game framework and the Shapley Q-value approach, enhancing reward distribution accuracy over shared reward methods.

## Key findings

- SQDDPG outperforms state-of-the-art algorithms in convergence rate.
- Shapley Q-value provides fair credit assignment among agents.
- Experimental results validate the effectiveness of the proposed method.

## Abstract

Cooperative game is a critical research area in the multi-agent reinforcement learning (MARL). Global reward game is a subclass of cooperative games, where all agents aim to maximize the global reward. Credit assignment is an important problem studied in the global reward game. Most of previous works stood by the view of non-cooperative-game theoretical framework with the shared reward approach, i.e., each agent being assigned a shared global reward directly. This, however, may give each agent an inaccurate reward on its contribution to the group, which could cause inefficient learning. To deal with this problem, we i) introduce a cooperative-game theoretical framework called extended convex game (ECG) that is a superset of global reward game, and ii) propose a local reward approach called Shapley Q-value. Shapley Q-value is able to distribute the global reward, reflecting each agent's own contribution in contrast to the shared reward approach. Moreover, we derive an MARL algorithm called Shapley Q-value deep deterministic policy gradient (SQDDPG), using Shapley Q-value as the critic for each agent. We evaluate SQDDPG on Cooperative Navigation, Prey-and-Predator and Traffic Junction, compared with the state-of-the-art algorithms, e.g., MADDPG, COMA, Independent DDPG and Independent A2C. In the experiments, SQDDPG shows a significant improvement on the convergence rate. Finally, we plot Shapley Q-value and validate the property of fair credit assignment.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.05707/full.md

## Figures

20 figures with captions in the complete paper: https://tomesphere.com/paper/1907.05707/full.md

## References

41 references — full list in the complete paper: https://tomesphere.com/paper/1907.05707/full.md

---
Source: https://tomesphere.com/paper/1907.05707