12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
Ahmet Bahaddin Ersoz

TL;DR
This paper introduces a multi-agent LLM benchmark inspired by '12 Angry Men', analyzing how different models debate and reach verdicts, revealing that alignment influences deliberative flexibility more than capability.
Contribution
It presents a novel multi-agent debate framework with LLMs conditioned on personas, comparing the effects of alignment levels on deliberation dynamics and outcomes.
Findings
Most runs ended in hung juries, showing anchoring as a key failure mode.
GPT-4o and Llama-4-Scout exhibit different internal dynamics and verdicts.
Alignment intensity, not capability, primarily determines deliberative flexibility.
Abstract
What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind? This paper instantiates that scenario as a multi-agent benchmark for LLM deliberation: twelve agents, each conditioned on a film-faithful persona, debate the film's murder case using multi-agent framework. Two models representing opposite ends of the RLHF spectrum are tested: GPT-4o (closed-source, heavy alignment) and Llama-4-Scout (open-weight, lighter alignment), across three conditions (baseline, open-minded prompt, no initial vote), with N = 3 replications per cell (18 runs total). Three findings emerge. (i) Seventeen of eighteen runs end in a hung jury (a state where the jury fails to reach a unanimous verdict); the film's central event, gradual minority-to-majority persuasion, almost never occurs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
