AI Alignment through Reinforcement Learning from Human Feedback?   Contradictions and Limitations

Adam Dahlgren Lindstr\"om; Leila Methnani; Lea Krause; Petter Ericson,; \'I\~nigo Mart\'inez de Rituerto de Troya; Dimitri Coelho Mollo; Roel Dobbe

arXiv:2406.18346·cs.AI·June 27, 2024·2 cites

AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

Adam Dahlgren Lindstr\"om, Leila Methnani, Lea Krause, Petter Ericson,, \'I\~nigo Mart\'inez de Rituerto de Troya, Dimitri Coelho Mollo, Roel Dobbe

PDF

Open Access

TL;DR

This paper critically examines the limitations and contradictions of using reinforcement learning from human or AI feedback to align AI systems with human values, highlighting sociotechnical challenges and ethical issues.

Contribution

It provides a multidisciplinary critique revealing theoretical and practical shortcomings of RLxF methods in capturing human ethics and safety.

Findings

01

RLxF struggles to fully encode human values and ethics.

02

Significant tensions exist between alignment goals like honesty and harmlessness.

03

Ethical trade-offs such as user-friendliness versus deception are often overlooked.

Abstract

This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCognitive Science and Mapping · Digital Transformation in Industry · Advanced Research in Systems and Signal Processing

MethodsALIGN