Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Haoming Yang, Ke Ma, Xiaojun Jia, Yingfei Sun, Qianqian Xu, Qingming Huang

TL;DR
This paper introduces ICRT, a novel framework inspired by human heuristics to effectively bypass LLM safety mechanisms, revealing vulnerabilities and aiding in developing stronger defenses.
Contribution
The paper presents a new jailbreak attack method using cognitive heuristics and a ranking-based evaluation metric for harmfulness, improving over existing brute-force approaches.
Findings
ICRT consistently bypasses mainstream LLM safety measures.
The ranking-based metric effectively quantifies harmfulness of generated content.
Experimental results demonstrate the method's high success rate in inducing harmful outputs.
Abstract
Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in human cognition. Leveraging the simplicity effect, we employ cognitive decomposition to reduce the complexity of malicious prompts. Simultaneously, relevance bias is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsERP Systems Implementation and Impact · Cooperative Studies and Economics
