Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

TL;DR
This paper formalizes the inherent trade-offs in reinforcement learning from human feedback (RLHF), showing that achieving safety, fairness, and robustness simultaneously at a global scale is computationally infeasible, and current methods compromise on representativeness.
Contribution
The paper introduces the Alignment Trilemma, a formal complexity-theoretic framework that explains the fundamental trade-offs in RLHF and analyzes why current approaches sacrifice representativeness.
Findings
Achieving both representativeness and robustness at scale requires super-polynomial operations.
Current RLHF methods collect limited samples from homogeneous pools, far below what's needed for true global representation.
The framework explains RLHF issues like bias amplification and preference collapse.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
