Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior
Md Sultan Al Nahian, Spencer Frazier, Brent Harrison, Mark Riedl

TL;DR
This paper proposes a reinforcement learning method that incorporates a normative prior to ensure agents behave in socially acceptable ways while performing tasks, tested across three text-based environments.
Contribution
It introduces a dual-reward reinforcement learning framework combining task performance with normative behavior, leveraging a normative prior model for improved societal alignment.
Findings
Agents trained with normative rewards exhibit more socially acceptable behaviors.
The approach balances task effectiveness with normative compliance.
Testing across three environments demonstrates improved normative behavior.
Abstract
As more machine learning agents interact with humans, it is increasingly a prospect that an agent trained to perform a task optimally, using only a measure of task performance as feedback, can violate societal norms for acceptable behavior or cause harm. Value alignment is a property of intelligent agents wherein they solely pursue non-harmful behaviors or human-beneficial goals. We introduce an approach to value-aligned reinforcement learning, in which we train an agent with two reward signals: a standard task performance reward, plus a normative behavior reward. The normative behavior reward is derived from a value-aligned prior model previously shown to classify text as normative or non-normative. We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as being more normative. We test our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI
