The Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust Alignment
Austin Spizzirri

TL;DR
Static content-based AI value alignment approaches are fundamentally insufficient for robust alignment in advanced AI systems due to philosophical and structural challenges, especially as capabilities grow.
Contribution
The paper identifies core philosophical and structural limitations of fixed-value alignment methods and advocates for open, developmentally responsive approaches.
Findings
Fixed-value alignment methods face philosophical and structural issues.
Current approaches are vulnerable to failure modes that worsen with AI capability.
Open, developmentally responsive alignment strategies are proposed as a potential solution.
Abstract
Static content-based AI value alignment is insufficient for robust alignment under capability scaling, distributional shift, and increasing autonomy. This holds for any approach that treats alignment as optimizing toward a fixed formal value-object, whether reward function, utility function, constitutional principles, or learned preference representation. Three philosophical results create compounding difficulties: Hume's is-ought gap (behavioral data underdetermines normative content), Berlin's value pluralism (human values resist consistent formalization), and the extended frame problem (any value encoding will misfit future contexts that advanced AI creates). RLHF, Constitutional AI, inverse reinforcement learning, and cooperative assistance games each instantiate this specification trap, and their failure modes reflect structural vulnerabilities, not merely engineering limitations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
