# Multi-Preference Actor Critic

**Authors:** Ishan Durugkar, Matthew Hausknecht, Adith Swaminathan, Patrick, MacAlpine

arXiv: 1904.03295 · 2019-04-09

## TL;DR

The paper proposes M-PAC, a reinforcement learning method that integrates multiple human feedback types as constraints into policy learning, improving efficiency and respecting preferences.

## Contribution

It introduces a novel framework to incorporate diverse feedback channels into policy gradient methods using a Lagrangian relaxation approach.

## Key findings

- Constraints are effectively respected in experiments.
- Incorporating feedback accelerates learning.
- Method outperforms standard RL in Atari and Pendulum tasks.

## Abstract

Policy gradient algorithms typically combine discounted future rewards with an estimated value function, to compute the direction and magnitude of parameter updates. However, for most Reinforcement Learning tasks, humans can provide additional insight to constrain the policy learning. We introduce a general method to incorporate multiple different feedback channels into a single policy gradient loss. In our formulation, the Multi-Preference Actor Critic (M-PAC), these different types of feedback are implemented as constraints on the policy. We use a Lagrangian relaxation to satisfy these constraints using gradient descent while learning a policy that maximizes rewards. Experiments in Atari and Pendulum verify that constraints are being respected and can accelerate the learning process.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.03295/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/1904.03295/full.md

## References

22 references — full list in the complete paper: https://tomesphere.com/paper/1904.03295/full.md

---
Source: https://tomesphere.com/paper/1904.03295