AI Alignment and Social Choice: Fundamental Limitations and Policy Implications
Abhilash Mishra

TL;DR
This paper explores the fundamental limitations of using reinforcement learning with human feedback (RLHF) for AI alignment, revealing that universal alignment respecting democratic norms and individual privacy is impossible, with significant policy implications.
Contribution
It demonstrates that RLHF cannot achieve universal AI alignment due to social choice impossibility results, highlighting the need for transparent governance and narrow alignment strategies.
Findings
Universal AI alignment via RLHF is impossible due to social choice limitations.
Aligning with all individuals' values violates private ethical preferences.
Transparent voting rules are essential for accountable AI governance.
Abstract
Aligning AI agents to human intentions and values is a key bottleneck in building safe and deployable AI applications. But whose values should AI agents be aligned with? Reinforcement learning with human feedback (RLHF) has emerged as the key framework for AI alignment. RLHF uses feedback from human reinforcers to fine-tune outputs; all widely deployed large language models (LLMs) use RLHF to align their outputs to human values. It is critical to understand the limitations of RLHF and consider policy challenges arising from these limitations. In this paper, we investigate a specific challenge in building RLHF systems that respect democratic norms. Building on impossibility results in social choice theory, we show that, under fairly broad assumptions, there is no unique voting protocol to universally align AI systems using RLHF through democratic processes. Further, we show that aligning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
MethodsFocus · ALIGN
