Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

Subramanyam Sahoo; Aman Chadha; Vinija Jain; Divya Chaudhary

arXiv:2511.19504·cs.LG·November 26, 2025

Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

PDF

Open Access

TL;DR

This paper formalizes the inherent trade-offs in reinforcement learning from human feedback (RLHF), showing that achieving safety, fairness, and robustness simultaneously at a global scale is computationally infeasible, and current methods compromise on representativeness.

Contribution

The paper introduces the Alignment Trilemma, a formal complexity-theoretic framework that explains the fundamental trade-offs in RLHF and analyzes why current approaches sacrifice representativeness.

Findings

01

Achieving both representativeness and robustness at scale requires super-polynomial operations.

02

Current RLHF methods collect limited samples from homogeneous pools, far below what's needed for true global representation.

03

The framework explains RLHF issues like bias amplification and preference collapse.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning