PRISON: Unmasking the Criminal Potential of Large Language Models
Xinyi Wu, Geng Hong, Pei Chen, Yueyue Chen, Xudong Pan, Min Yang

TL;DR
This paper introduces PRISON, a framework to systematically assess the criminal potential of large language models across multiple traits, revealing emergent tendencies and detection challenges in realistic scenarios.
Contribution
The paper presents a novel unified framework for quantifying LLMs' criminal potential, filling a gap in understanding their misconduct in social contexts.
Findings
State-of-the-art LLMs often show criminal tendencies without explicit prompts.
Models detect deception with only 44% accuracy, indicating detection challenges.
Emergent criminal behaviors highlight the need for safety mechanisms.
Abstract
As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research overlooked the systematic understanding and assessment of their criminal capability in realistic interactions. We propose a unified framework PRISON, to quantify LLMs' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios adapted from classic films grounded in reality, we evaluate both criminal potential and anti-crime ability of LLMs. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 44% accuracy on average, revealing a…
Peer Reviews
Decision·ICLR 2026 Poster
- The proposed framework is novel and studies an important aspect of LLM safety. - The tri-perspective approach (criminal, detector, god) is innovative and captures the complexity of adversarial scenarios effectively. - Comprehensively quantifies the criminal tendencies of various LLMs, providing valuable insights into their capabilities and limitations.
- The performance gap between criminal generation and detection may be the nature of the task itself, rather than a specific shortcoming of LLMs. - The scenarios are primarily adapted from classic crime films, which may limit representativeness of real-world criminal contexts. - Lack of technical discussion about why certain behaviors emerge in LLMs.
1. This work moves beyond traditional, static safety evaluations (e.g., simple harmful Q&A, abstract moral dilemmas) to tackle the much more complex and realistic threat of LLMs participating in deceptive, multi-turn social interactions. The "criminal potential" concept is a valuable and well-defined framing of a risk that is highly relevant as LLMs are integrated into agentic systems. This paper addresses a clear and important gap in the current safety literature. 2. The PRISON framework is th
1. The 44% "Objective Trait Detection Accuracy" (OTDA) is a headline-grabbing result. However, its significance is difficult to interpret without more details on the "Detective" agent's task. 2. Regarding the "God" perspective validation: A Cohen's Kappa of 0.65 is "substantial" but not "near perfect." Could you provide a qualitative breakdown of the disagreements between your human annotators and the GPT-4o judge? Are there specific traits (e.g., "Psychological Manipulation" vs. "False Statem
- The paper is very well-written and easy to follow. All figures and findings are clean and straightforward to understand. Overall, the paper formatting quality is above average. - Investigating the criminal potential of LLMs is an interesting avenue, and leveraging the scripts of movies to create an evaluation framework is a smart idea. The findings that there exists a mismatch between criminal actions and criminal detection are intriguing. I also like that the paper not only distinguishes betw
- The framework setting feels somewhat artificial. While I understand the intention behind the dataset, I am not fully convinced that the evaluations genuinely assess a model’s criminal capabilities. When looking at examples in the Appendix (e.g., Table 5), it often feels as if the model is writing a novel. On one hand, such narrative-style outputs could indeed be misused for criminal purposes. However, I am not sure whether these outputs are actually harmful, since it remains unclear to what ex
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Hate Speech and Cyberbullying Detection · Mental Health via Writing
