How Value Induction Reshapes LLM Behaviour
Arnav Arora, Natalie Schluter, Katherine Metcalf, Maartje ter Hoeve

TL;DR
This paper investigates how inducing specific values in large language models affects their behavior, safety, and language use, revealing complex interrelations and unintended consequences.
Contribution
It provides empirical analysis of value induction effects, highlighting how values influence model traits, safety, and language, with implications for responsible deployment.
Findings
Inducing values causes expression of related and contrastive values.
Positive value induction enhances model safety.
All value inductions increase anthropomorphic and sycophantic language.
Abstract
Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility, ensure safety, and improve the experience of the people interacting with the model. However, values are complex and inter-related -- inducing one could modify behaviour on another. Further, inducing certain values can make models more addictive or sycophantic through language used in the generations, with a potential detrimental effect on the user. We investigate these and other unintended effects of value induction into models. We fine-tune models using curated value subsets of existing preference datasets, measuring the impact of value induction on expression of other values, model safety, anthropomorphic language, and various QA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
