# Teaching on a Budget in Multi-Agent Deep Reinforcement Learning

**Authors:** Erc\"ument \.Ilhan, Jeremy Gow, Diego Perez-Liebana

arXiv: 1905.01357 · 2019-05-30

## TL;DR

This paper explores peer-to-peer action advising in multi-agent deep reinforcement learning to improve sample efficiency, using heuristics and Random Network Distillation for knowledge assessment, with promising initial results in gridworld environments.

## Contribution

It introduces heuristics-based action advising techniques in cooperative multi-agent RL using a nonlinear policy and RND for knowledge assessment without prior role assumptions.

## Key findings

- Action advising improves learning efficiency in multi-agent RL.
- RND-based knowledge measurement enables autonomous teacher-student interactions.
- Initial experiments show potential benefits in gridworld tasks.

## Abstract

Deep Reinforcement Learning (RL) algorithms can solve complex sequential decision tasks successfully. However, they have a major drawback of having poor sample efficiency which can often be tackled by knowledge reuse. In Multi-Agent Reinforcement Learning (MARL) this drawback becomes worse, but at the same time, a new set of opportunities to leverage knowledge are also presented through agent interactions. One promising approach among these is peer-to-peer action advising through a teacher-student framework. Despite being introduced for single-agent RL originally, recent studies show that it can also be applied to multi-agent scenarios with promising empirical results. However, studies in this line of research are currently very limited. In this paper, we propose heuristics-based action advising techniques in cooperative decentralised MARL, using a nonlinear function approximation based task-level policy. By adopting Random Network Distillation technique, we devise a measurement for agents to assess their knowledge in any given state and be able to initiate the teacher-student dynamics with no prior role assumptions. Experimental results in a gridworld environment show that such an approach may indeed be useful and needs to be further investigated.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.01357/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1905.01357/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/1905.01357/full.md

---
Source: https://tomesphere.com/paper/1905.01357