TL;DR
This paper introduces CognitiveAttack, a framework exploiting multiple cognitive biases to systematically bypass safety measures in large language models, revealing significant vulnerabilities and exposing limitations of current defenses.
Contribution
It presents a novel interdisciplinary red-teaming approach leveraging bias interactions, significantly improving attack success rates over existing methods.
Findings
CognitiveAttack achieves 60.1% success rate, outperforming the SOTA PAP method.
Vulnerabilities are widespread across 30 diverse LLMs, especially open-source models.
Multi-bias interactions are identified as a powerful attack vector.
Abstract
Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
