When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

Steffen Backmann; David Guzman Piedrahita; Emanuel Tewolde; Rada Mihalcea; Bernhard Sch\"olkopf; Zhijing Jin

arXiv:2505.19212·cs.CL·May 27, 2025

When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Sch\"olkopf, Zhijing Jin

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces MoralSim, a simulation framework to evaluate how large language models behave in social dilemmas with moral conflicts, revealing significant variability and lack of consistent moral behavior across models.

Contribution

The study systematically assesses LLMs' moral decision-making in social dilemmas with conflicting incentives, highlighting the variability and challenges in achieving ethical alignment.

Findings

01

Models show varied moral tendencies across scenarios.

02

No model consistently acts morally in all tested situations.

03

Behavior depends on moral framing and situational factors.

Abstract

Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs' moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

- The MoralSim framework combines classic game-theoretic scenarios with real-world moral dilemmas, enabling a systematic and comprehensive evaluation. This framework is more structured than previous fragmented tests. For example, compared to the MACHIAVELLI benchmark—which primarily uses text adventure games to assess agent behaviorarxiv.org—MoralSim employs formal game structures integrated with moral contexts, allowing for more controlled and easily quantifiable comparisons. This could be bett

Weaknesses

- Scientificness of the evaluation of this assumption. Although multi-agent framework could provide a more vivid setting for revoke LLM's decision under certain scenarios, it could still be a question of how real and how consistent these evaluations are. According to my experience these testings are easily be changed by small parts of prompts. Yet, this paper don't provide a convincing enough evidence to illustrate the scintificness of this testing. - The novelty is somewhat incremental. Altho

Reviewer 02Rating 6Confidence 4

Strengths

- The paper is well-written and the experiments are well-designed. - The settings developed by the paper more clearly expose trade-offs in moral behavior compared to prior work. - There are clear behavioral takeaways from the paper, e.g., opponent behavior meaningfully steers LLM actions or that moral context improves the morality of LLM behaviors.

Weaknesses

While the games offer a step towards realism, they still do not capture the full nuances of reality. In particular: - All the games are two-player. Many real-world settings involve multiple players with different levels of power and interlocking incentives. - The text of the games themselves is still somewhat unrealistic. There is an explicit payoff structure described in the system prompt (e.g., Figure 2), whereas in the real world an LLM agent would need to uncover those trade-offs themselves.

Reviewer 03Rating 6Confidence 5

Strengths

- A full-factor design with clear manipulations (type of game, moral context, survival risk, opponent behavior) allows you to isolate the effects of each factor - Quality of analysis. Analysis of ~3500 reflections of agents reveals decision-making mechanisms. Causal assessment through ATEs yields quantitative effects with confidence intervals. It is shown that profit maximizer models (Deepseek-R1, Qwen) rely on profit maximization, while more cooperative ones (Claude, GPT-4o) more often take int

Weaknesses

1. PD and PGG only. Other structures (for example, Trust Game, Stag Hunt) could reveal other patterns of moral behavior. An extension to asymmetric games would be especially valuable. 2. In the real world, agents can often negotiate, which significantly changes the dynamics of cooperation. The authors acknowledge this, but do not investigate it. 3. For some models (Claude-3.7-Sonnet, Gemini-2.5-Flash), versions without reasoning mode were used for cost reasons. Given that the analysis has shown

Code & Models

Repositories

sbackmann/moralsim
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, Economics, and Judicial Systems