Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
Jiajun Chen, Hua Shen

TL;DR
This paper introduces VAT, a framework to measure how alignment interventions in LLMs cause shifts and trade-offs among interconnected values, revealing unintended effects often hidden in traditional evaluations.
Contribution
VAT provides a systematic way to quantify value trade-offs and system-level dynamics of value expression in LLMs under various alignment interventions.
Findings
Alignment causes uneven co-movement among values.
Traditional evaluations miss many unintended value shifts.
VAT reveals systematic trade-offs between target and non-target values.
Abstract
Existing work on value alignment typically characterizes value relations statically, ignoring how alignment interventions, such as prompting, fine-tuning, or preference optimization, reshape the broader value system. In practice, aligning a target value can implicitly shift other values, creating value trade-offs that remain largely unmeasured. We introduce VAT, a framework that quantifies value trade-offs by measuring how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the system-level dynamics of value expression under alignment intervention, enabling evaluation of both intended improvements and unintended side effects. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and interventions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
