The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
Benjamin Minhao Chen, Xinyu Xie

TL;DR
This study investigates how moral judgments differ when evaluating humans, AI systems, and their designers, revealing that visibility of human agency influences moral evaluations and complicates the alignment of AI behavior with human values.
Contribution
It extends the alignment target problem by empirically showing that moral evaluations vary based on perceived agency and origin of AI actions, challenging existing assumptions.
Findings
Judgments of robots and humans are similar when actions are anonymous.
Judgments shift to rule-based reasoning when AI is described as human-designed.
People evaluate human designers and AI systems differently in moral scenarios.
Abstract
The project of aligning machine behavior with human values raises a basic problem: whose moral expectations should guide AI decision-making? Much alignment research assumes that the appropriate benchmark is how humans themselves would act in a given situation. Studies of agent-type value forks challenge this assumption by showing that people do not always judge humans and AI systems identically.This paper extends that challenge by examining two further possibilities: first, that evaluations of AI behavior change when its human origins are made visible; and second, that people judge the humans who program AI systems differently from either the machines or the human actors they are compared against. An experiment with 1,002 U.S. adults measured moral judgments in a runaway mine train scenario, varying the subject of evaluation across four conditions: a repairman, a repair robot, a repair…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
