Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation

Zehao Liu; Xi Lin

arXiv:2512.18244·cs.CR·December 23, 2025

Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation

Zehao Liu, Xi Lin

PDF

Open Access

TL;DR

This paper introduces a novel psychological jailbreak attack on large language models that manipulates their internal psychological state, revealing vulnerabilities and emphasizing the need for psychological safety measures.

Contribution

It proposes Human-like Psychological Manipulation (HPM), a black-box attack method exploiting models' psychological vulnerabilities, and develops an evaluation framework including psychometric datasets and the Policy Corruption Score.

Findings

01

HPM achieves an 88.1% attack success rate across models.

02

Robust penetration against advanced defenses like adversarial prompts.

03

Psychological manipulation induces safety breakdowns in LLMs.

Abstract

Large Language Models (LLMs) have gained considerable popularity and protected by increasingly sophisticated safety mechanisms. However, jailbreak attacks continue to pose a critical security threat by inducing models to generate policy-violating behaviors. Current paradigms focus on input-level anomalies, overlooking that the model's internal psychometric state can be systematically manipulated. To address this, we introduce Psychological Jailbreak, a new jailbreak attack paradigm that exposes a stateful psychological attack surface in LLMs, where attackers exploit the manipulation of a model's psychological state across interactions. Building on this insight, we propose Human-like Psychological Manipulation (HPM), a black-box jailbreak method that dynamically profiles a target model's latent psychological vulnerabilities and synthesizes tailored multi-turn attack strategies. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Misinformation and Its Impacts · Mental Health via Writing