Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models
Seong Hah Cho, Junyi Li, Anna Leshinskaya

TL;DR
This paper investigates whether large language models differentiate between moral, grammatical, and economic values, finding pervasive entanglement where moral values overly influence other types, which can be mitigated by targeted activation ablation.
Contribution
It demonstrates that LLMs conflate different kinds of value and shows a method to reduce this entanglement through activation ablation.
Findings
LLMs show pervasive value entanglement among moral, grammatical, and economic values.
Moral value overly influences grammatical and economic valuations in LLMs.
Selective ablation of moral activation vectors reduces value conflation.
Abstract
Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Explainable Artificial Intelligence (XAI) · Sentiment Analysis and Opinion Mining
