
TL;DR
This paper introduces a geometric framework to formally characterize the alignment tax in AI systems, analyzing safety-capability tradeoffs through subspace projections and deriving a Pareto frontier that guides understanding of safety and capability interactions.
Contribution
It provides a novel geometric theory of the alignment tax, including a formal definition, Pareto frontier derivation, and a scaling law decomposition under linear representation assumptions.
Findings
Derived the Pareto frontier governing safety-capability tradeoffs.
Proved the tightness and recursive structure of the Pareto frontier.
Decomposed the alignment tax into irreducible and residual components.
Abstract
The alignment tax is widely discussed but has not been formally characterized. We provide a geometric theory of the alignment tax in representation space. Under linear representation assumptions, we define the alignment tax rate as the squared projection of the safety direction onto the capability subspace and derive the Pareto frontier governing safety-capability tradeoffs, parameterized by a single quantity of the principal angle between the safety and capability subspaces. We prove this frontier is tight and show it has a recursive structure. safety-safety tradeoffs under capability constraints are governed by the same equation, with the angle replaced by the partial correlation between safety objectives given capability directions. We derive a scaling law decomposing the alignment tax into an irreducible component determined by data structure and a packing residual that vanishes as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification · Security and Verification in Computing · Adversarial Robustness in Machine Learning
