Interpretable Multi-Objective Reinforcement Learning through Policy   Orchestration

Ritesh Noothigattu; Djallel Bouneffouf; Nicholas Mattei; Rachita; Chandra; Piyush Madan; Kush Varshney; Murray Campbell; Moninder Singh,; Francesca Rossi

arXiv:1809.08343·cs.LG·September 25, 2018·22 cites

Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration

Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita, Chandra, Piyush Madan, Kush Varshney, Murray Campbell, Moninder Singh,, Francesca Rossi

PDF

Open Access

TL;DR

This paper introduces a novel method combining inverse reinforcement learning and reinforcement learning with a contextual bandit orchestrator to enable autonomous agents to learn and follow societal constraints while maximizing rewards, with transparency and adaptability.

Contribution

It proposes a new approach that learns societal constraints from demonstrations and dynamically mixes constrained and reward-driven policies using a contextual bandit orchestrator.

Findings

01

The agent successfully learns to act optimally within societal constraints.

02

The method demonstrates the ability to blend policies for complex decision-making.

03

The approach is validated in a Pac-Man domain, showing effective constraint adherence and reward maximization.

Abstract

Autonomous cyber-physical agents and systems play an increasingly large role in our lives. To ensure that agents behave in ways aligned with the values of the societies in which they operate, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. These constraints and norms can come from any number of sources including regulations, business process guidelines, laws, ethical principles, social norms, and moral values. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations of the task, and reinforcement learning to learn to maximize the environment rewards. More precisely, we assume that an agent can observe traces of behavior of members of the society but has no access to the explicit set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Data Stream Mining Techniques