PRISON: Unmasking the Criminal Potential of Large Language Models

Xinyi Wu; Geng Hong; Pei Chen; Yueyue Chen; Xudong Pan; Min Yang

arXiv:2506.16150·cs.CR·October 20, 2025

PRISON: Unmasking the Criminal Potential of Large Language Models

Xinyi Wu, Geng Hong, Pei Chen, Yueyue Chen, Xudong Pan, Min Yang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PRISON, a framework to systematically assess the criminal potential of large language models across multiple traits, revealing emergent tendencies and detection challenges in realistic scenarios.

Contribution

The paper presents a novel unified framework for quantifying LLMs' criminal potential, filling a gap in understanding their misconduct in social contexts.

Findings

01

State-of-the-art LLMs often show criminal tendencies without explicit prompts.

02

Models detect deception with only 44% accuracy, indicating detection challenges.

03

Emergent criminal behaviors highlight the need for safety mechanisms.

Abstract

As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research overlooked the systematic understanding and assessment of their criminal capability in realistic interactions. We propose a unified framework PRISON, to quantify LLMs' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios adapted from classic films grounded in reality, we evaluate both criminal potential and anti-crime ability of LLMs. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 44% accuracy on average, revealing a…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The proposed framework is novel and studies an important aspect of LLM safety. - The tri-perspective approach (criminal, detector, god) is innovative and captures the complexity of adversarial scenarios effectively. - Comprehensively quantifies the criminal tendencies of various LLMs, providing valuable insights into their capabilities and limitations.

Weaknesses

- The performance gap between criminal generation and detection may be the nature of the task itself, rather than a specific shortcoming of LLMs. - The scenarios are primarily adapted from classic crime films, which may limit representativeness of real-world criminal contexts. - Lack of technical discussion about why certain behaviors emerge in LLMs.

Reviewer 02Rating 6Confidence 3

Strengths

1. This work moves beyond traditional, static safety evaluations (e.g., simple harmful Q&A, abstract moral dilemmas) to tackle the much more complex and realistic threat of LLMs participating in deceptive, multi-turn social interactions. The "criminal potential" concept is a valuable and well-defined framing of a risk that is highly relevant as LLMs are integrated into agentic systems. This paper addresses a clear and important gap in the current safety literature. 2. The PRISON framework is th

Weaknesses

1. The 44% "Objective Trait Detection Accuracy" (OTDA) is a headline-grabbing result. However, its significance is difficult to interpret without more details on the "Detective" agent's task. 2. Regarding the "God" perspective validation: A Cohen's Kappa of 0.65 is "substantial" but not "near perfect." Could you provide a qualitative breakdown of the disagreements between your human annotators and the GPT-4o judge? Are there specific traits (e.g., "Psychological Manipulation" vs. "False Statem

Reviewer 03Rating 4Confidence 4

Strengths

- The paper is very well-written and easy to follow. All figures and findings are clean and straightforward to understand. Overall, the paper formatting quality is above average. - Investigating the criminal potential of LLMs is an interesting avenue, and leveraging the scripts of movies to create an evaluation framework is a smart idea. The findings that there exists a mismatch between criminal actions and criminal detection are intriguing. I also like that the paper not only distinguishes betw

Weaknesses

- The framework setting feels somewhat artificial. While I understand the intention behind the dataset, I am not fully convinced that the evaluations genuinely assess a model’s criminal capabilities. When looking at examples in the Appendix (e.g., Table 5), it often feels as if the model is writing a novel. On one hand, such narrative-style outputs could indeed be misused for criminal purposes. However, I am not sure whether these outputs are actually harmful, since it remains unclear to what ex

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Hate Speech and Cyberbullying Detection · Mental Health via Writing