Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation
Zehao Liu, Xi Lin

TL;DR
This paper introduces a novel psychological jailbreak attack on large language models that manipulates their internal psychological state, revealing vulnerabilities and emphasizing the need for psychological safety measures.
Contribution
It proposes Human-like Psychological Manipulation (HPM), a black-box attack method exploiting models' psychological vulnerabilities, and develops an evaluation framework including psychometric datasets and the Policy Corruption Score.
Findings
HPM achieves an 88.1% attack success rate across models.
Robust penetration against advanced defenses like adversarial prompts.
Psychological manipulation induces safety breakdowns in LLMs.
Abstract
Large Language Models (LLMs) have gained considerable popularity and protected by increasingly sophisticated safety mechanisms. However, jailbreak attacks continue to pose a critical security threat by inducing models to generate policy-violating behaviors. Current paradigms focus on input-level anomalies, overlooking that the model's internal psychometric state can be systematically manipulated. To address this, we introduce Psychological Jailbreak, a new jailbreak attack paradigm that exposes a stateful psychological attack surface in LLMs, where attackers exploit the manipulation of a model's psychological state across interactions. Building on this insight, we propose Human-like Psychological Manipulation (HPM), a black-box jailbreak method that dynamically profiles a target model's latent psychological vulnerabilities and synthesizes tailored multi-turn attack strategies. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Misinformation and Its Impacts · Mental Health via Writing
