JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs
Hongyi Li, Jiawei Ye, Jie Wu, Tianjie Yan, Chu Wang, Zhixin Li

TL;DR
JailPO is a new black-box framework that automates and enhances jailbreak attacks on aligned LLMs using preference optimization, demonstrating superior efficiency, universality, and robustness over existing methods.
Contribution
It introduces a preference optimization-based attack method and three flexible jailbreak patterns, improving scalability, effectiveness, and defense bypassing in LLM jailbreak attacks.
Findings
JailPO outperforms baselines in efficiency and universality.
Complex templates yield higher attack strength.
Covert question transformations bypass defenses more effectively.
Abstract
Large Language Models (LLMs) aligned with human feedback have recently garnered significant attention. However, it remains vulnerable to jailbreak attacks, where adversaries manipulate prompts to induce harmful outputs. Exploring jailbreak attacks enables us to investigate the vulnerabilities of LLMs and further guides us in enhancing their security. Unfortunately, existing techniques mainly rely on handcrafted templates or generated-based optimization, posing challenges in scalability, efficiency and universality. To address these issues, we present JailPO, a novel black-box jailbreak framework to examine LLM alignment. For scalability and universality, JailPO meticulously trains attack models to automatically generate covert jailbreak prompts. Furthermore, we introduce a preference optimization-based attack method to enhance the jailbreak effectiveness, thereby improving efficiency.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDigital and Cyber Forensics · Imbalanced Data Classification Techniques · Cybercrime and Law Enforcement Studies
