TL;DR
This paper investigates the persistent moral and value biases in large language models, revealing that they maintain consistent value orientations despite varied prompts, highlighting internal biases that need addressing.
Contribution
It uncovers the existence of value inertia in LLMs, demonstrating that their moral responses remain skewed despite different persona prompts, which is a novel insight.
Findings
LLMs show consistent value orientations across different prompts.
Certain moral dimensions like harm avoidance and fairness are persistently skewed.
Models exhibit internal biases and value preferences that influence their outputs.
Abstract
Large Language Models (LLMs) behave non-deterministically, and prompting has become a common method for steering their outputs. A popular strategy is to assign a persona to the model to produce more varied, context-sensitive responses, similar to how responses vary across human individuals. Against the expectation that persona prompting yields a wide range of opinions, our experiments show that LLMs keep consistent value orientations. We observe a persistent inertia in their responses, where certain moral and value dimensions (especially harm avoidance and fairness) stay skewed in one direction across persona settings. To study this, we use role-play at scale, which pairs randomized persona prompts with a macro-level analysis of model outputs. Our results point to strong internal biases and value preferences in LLMs, which we call value orientation and inertia. These models warrant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
