Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions
Shangrui Nie, Florian Mai, David Kacz\'er, Charles Welch, Zhixue Zhao, Lucie Flek

TL;DR
This paper explores a straightforward method to align large language models' human values with desired preferences by fine-tuning them on survey questions, resulting in significant behavioral shifts in various tasks.
Contribution
It introduces a simple survey-based fine-tuning approach to modify LLMs' value systems and demonstrates its effectiveness across multiple evaluation scenarios.
Findings
Fine-tuning on survey questions alters model responses in-domain.
Model behavior shifts significantly in out-of-domain scenarios.
The approach effectively aligns model values with desired human preferences.
Abstract
Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model's value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model's behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model's behavior changes in out-of-domain settings (situational scenarios). To this end,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
