Foot In The Door: Understanding Large Language Model Jailbreaking via   Cognitive Psychology

Zhenhua Wang; Wei Xie; Baosheng Wang; Enze Wang; Zhiwen Gui,; Shuoyoucheng Ma; Kai Chen

arXiv:2402.15690·cs.CL·February 27, 2024·2 cites

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

Zhenhua Wang, Wei Xie, Baosheng Wang, Enze Wang, Zhiwen Gui,, Shuoyoucheng Ma, Kai Chen

PDF

Open Access

TL;DR

This paper explores how cognitive psychology explains LLM jailbreaking, proposing a new incremental prompting method based on psychological theory, achieving high success rates across multiple models.

Contribution

It introduces a psychological framework for understanding jailbreak prompts and develops an automatic black-box jailbreaking method using the Foot-in-the-Door technique.

Findings

01

Average jailbreak success rate of 83.9% across 8 LLMs

02

Proposes a new incremental prompting method based on cognitive psychology

03

Provides insights into the decision-making process of LLMs

Abstract

Large Language Models (LLMs) have gradually become the gateway for people to acquire new knowledge. However, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." Previous studies have shown the weakness of current LLMs when confronted with such jailbreaking attacks. Nevertheless, comprehension of the intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak prompts is noticeably lacking. Our research provides a psychological explanation of the jailbreak prompts. Drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the LLM to achieve cognitive coordination in an erroneous direction. Further, we propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique. This method progressively induces the model to answer harmful questions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Authorship Attribution and Profiling · Hate Speech and Cyberbullying Detection