# Legal Decision Biases in GPT: A Comparison with Human Judgment

**Authors:** Toscane F. Bessis, Andy J. Wills, Bartosz W. Wojciechowski, Lee C. White, Emmanuel M. Pothos

PMC · DOI: 10.3390/bs16030437 · Behavioral Sciences · 2026-03-17

## TL;DR

This paper compares legal decision biases in GPT models and humans, finding that both are influenced by procedural factors like evidence order.

## Contribution

The study reveals that advanced language models like GPT-4o and GPT-5.2 exhibit procedural biases similar to humans in legal judgments.

## Key findings

- GPT-4o showed strong order effects and evaluation bias in legal judgments.
- Prompt engineering had limited success in reducing these biases in both GPT-4o and GPT-5.2.
- Procedural influences affect model decisions similarly to human legal professionals.

## Abstract

Legal decision-making is expected to meet high standards of consistency and rationality, yet human judgments in this domain are known to be influenced by procedural factors such as evidence order and intermediate evaluations. Recent work has shown that even legal professionals, including judges, are susceptible to such biases when assessing criminal cases. This raises a critical question: do large language models, which are increasingly proposed as decision-support tools in legal contexts, exhibit similar procedural biases—and if so, can these biases be mitigated? To address this question, we tested GPT-4o and GPT-5.2 using a controlled legal judgment task adapted from prior human research. The task involved simplified criminal cases in which we systematically manipulated (i) the order of incriminating and exonerating evidence and (ii) whether an intermediate guilt judgment was required before a final decision. Model responses were directly compared to human judgments from the original study. We additionally examined whether prompt engineering strategies, based on current best-practice recommendations, could reduce observed biases. GPT-4o exhibited robust order effects and a form of evaluation bias, although the latter differed in structure from the human pattern. GPT-5.2 showed similar but attenuated effects. Across both models, prompt engineering had limited and inconsistent impact, failing to reliably eliminate procedural sensitivity. These findings suggest that even advanced large language models remain vulnerable to normatively irrelevant procedural influences. More broadly, they advise caution in treating large language models as inherently rational or bias-resistant decision-support systems in high-stakes professional domains such as law.

## Full-text entities

- **Chemicals:** GPT-4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13024114/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13024114/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/PMC13024114/full.md

---
Source: https://tomesphere.com/paper/PMC13024114