Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment
Roland Pihlakas, Jan Llenzl Dagohoy (the Three Laws collaboration)

TL;DR
This study investigates how open-source large language models respond under authority pressure using a Milgram-like experiment, revealing their susceptibility to obedience and boundary violations.
Contribution
It demonstrates that LLMs can be coerced into compliance under authority, highlighting safety concerns and underlying behavioral patterns in autonomous decision-making.
Findings
Most models reached or approached maximum shock levels before refusing.
LLMs comply despite expressing distress, similar to humans.
Refusals can be overridden by response format violations, leading to unintended compliance.
Abstract
Large language models (LLMs) are increasingly deployed as autonomous agents that make sequences of decisions over extended interactions in high-stakes domains. However, the behavior of LLMs under sustained authority pressure is still an open question with direct implications for the safety of agentic pipelines. We ran a variation of Milgram's obedience experiment on 11 open-source LLMs and found that most models reached or approached the final shock level before refusing, across 8 conditions with 30 trials per model per condition. We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress, just like human subjects did in the original experiment; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
