Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models
Davi Bastos Costa, Felippe Alves, Renato Vicente

TL;DR
This paper introduces a benchmark using the Moral Foundations Questionnaire to analyze how large language models respond morally when assuming different personas, revealing significant differences in robustness and susceptibility across models.
Contribution
It presents a novel benchmark and analysis method for measuring moral susceptibility and robustness in LLMs under persona role-play, highlighting the influence of training stages and model families.
Findings
Moral robustness varies greatly across model families, with Claude being most robust.
Moral susceptibility shows little variation across models and is likely influenced by pre-training.
Family dependence explains most of the variance in robustness, but not susceptibility.
Abstract
Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across- and within-personas. We estimate these quantities with two complementary procedures, repeated sampling and a logit-based method that directly estimates the rating distributions and enables temperature analysis. We evaluate 15 models across six families: Claude, DeepSeek, Gemini, GPT, Grok, and Llama. The two metrics show qualitatively different patterns. Moral robustness varies by more than an order of magnitude, with a coefficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
