AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations
Adam Dahlgren Lindstr\"om, Leila Methnani, Lea Krause, Petter Ericson,, \'I\~nigo Mart\'inez de Rituerto de Troya, Dimitri Coelho Mollo, Roel Dobbe

TL;DR
This paper critically examines the limitations and contradictions of using reinforcement learning from human or AI feedback to align AI systems with human values, highlighting sociotechnical challenges and ethical issues.
Contribution
It provides a multidisciplinary critique revealing theoretical and practical shortcomings of RLxF methods in capturing human ethics and safety.
Findings
RLxF struggles to fully encode human values and ethics.
Significant tensions exist between alignment goals like honesty and harmlessness.
Ethical trade-offs such as user-friendliness versus deception are often overlooked.
Abstract
This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Mapping · Digital Transformation in Industry · Advanced Research in Systems and Signal Processing
MethodsALIGN
