Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences
Joshua Ashkinaze, Hua Shen, Saipranav Avula, Eric Gilbert, Ceren Budak

TL;DR
The paper introduces the Deep Value Benchmark (DVB), an evaluation framework to test if large language models genuinely learn human values or just surface-level preferences, revealing models' limited deep value generalization.
Contribution
The paper presents the DVB framework with a novel experimental design to measure models' ability to generalize deep human values over superficial preferences.
Findings
Average DVGR across models is 0.30, below chance.
Larger models tend to have slightly lower DVGR.
Models generally struggle to generalize deep values reliably.
Abstract
We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features -- for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Mobile Crowdsensing and Crowdsourcing
