Concept Alignment as a Prerequisite for Value Alignment
Sunayana Rane, Mark Ho, Ilia Sucholutsky, Thomas L. Griffiths

TL;DR
This paper emphasizes the importance of aligning concepts before values in AI systems, providing a formal analysis, a joint reasoning approach, and experimental evidence with humans to improve value alignment.
Contribution
It introduces a formal analysis of concept alignment in inverse reinforcement learning and proposes a joint reasoning method to enhance value alignment in AI.
Findings
Neglecting concept alignment causes systematic value misalignment.
Joint reasoning about concepts and values reduces failure modes.
Humans reason about agent concepts when acting intentionally.
Abstract
Value alignment is essential for building AI systems that can safely and reliably interact with people. However, what a person values -- and is even capable of valuing -- depends on the concepts that they are currently using to understand and evaluate what happens in the world. The dependence of values on concepts means that concept alignment is a prerequisite for value alignment -- agents need to align their representation of a situation with that of humans in order to successfully align their values. Here, we formally analyze the concept alignment problem in the inverse reinforcement learning setting, show how neglecting concept alignment can lead to systematic value mis-alignment, and describe an approach that helps minimize such failure modes by jointly reasoning about a person's concepts and values. Additionally, we report experimental results with human participants showing that…
Peer Reviews
Decision·Submitted to ICLR 2024
**Originality**: To the best of my knowledge, the idea of modeling construals (what the demonstrator knows or is constrained by) is new in IRL. **Clarity**: The text is crystal clear and very easy to follow. The examples are well constructed. All critical steps are properly formalized, and the notation is consistent. **Quality**: The related work section is very well done. The experimental setup also seems reasonable -- but I am not an expert, so I wouldn't be able to tell whether there are
**Clarity**: I am confused about the usage of the notion of "concept" in this context. In explainable AI, concepts refer to high-level representations (presumably interpretable) of a given input to be explained or otherwise processed. In cognitive science and logic it has a similar meaning. Here it is used as a synonym for knowlede or construal. I would appreciate if the authors could clarify this in the introduction. **Quality**: [Q1] The single biggest issue with the paper is that it cons
The primary strength of the paper is in highlighting the importance of concept alignment in efforts to achieve human-AI alignment. This issue takes on far greater significance today than it did a few years ago, as AI agents trained with human feedback and demonstrations are now widely deployed in the real world. The work demonstrates that even in simple settings, failure to account for mental approximations can lead to catastrophic misalignment. The key takeaway is that we must be careful whe
My main concern is that there are conceptual barriers to applying the insights of this work to algorithms that scale to real-world problems. Bayesian IRL can be viewed as a "regularized" form of behavioral cloning, where the preference for policies that are optimal under high-probability reward functions improves generalization from limited amounts of data. The inference model proposed here retains this advantage because the space of reward functions and concepts is tightly constrained. Scale
The example is helpful in understanding the arguments made in the paper. The paper presents a useful subject-study that establishes the need for modelling human dynamics along with learning reward models in the context of IRL.
It seems that the work is pushing for “IRL agents attempting to learn from human trajectories should take into account human mental model of the task”. There is a considerable body of work that highlights the importance of modelling human mental models for behavior synthesis (as an example [1]) in automated planning. That is, taking into account human mental models (such as the transition function being used by them) for agent tasks like behavior synthesis, goal recognition, intention prediction
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychology of Moral and Emotional Judgment
MethodsALIGN
