JailPO: A Novel Black-box Jailbreak Framework via Preference   Optimization against Aligned LLMs

Hongyi Li; Jiawei Ye; Jie Wu; Tianjie Yan; Chu Wang; Zhixin Li

arXiv:2412.15623·cs.CR·December 23, 2024

JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs

Hongyi Li, Jiawei Ye, Jie Wu, Tianjie Yan, Chu Wang, Zhixin Li

PDF

Open Access 2 Videos

TL;DR

JailPO is a new black-box framework that automates and enhances jailbreak attacks on aligned LLMs using preference optimization, demonstrating superior efficiency, universality, and robustness over existing methods.

Contribution

It introduces a preference optimization-based attack method and three flexible jailbreak patterns, improving scalability, effectiveness, and defense bypassing in LLM jailbreak attacks.

Findings

01

JailPO outperforms baselines in efficiency and universality.

02

Complex templates yield higher attack strength.

03

Covert question transformations bypass defenses more effectively.

Abstract

Large Language Models (LLMs) aligned with human feedback have recently garnered significant attention. However, it remains vulnerable to jailbreak attacks, where adversaries manipulate prompts to induce harmful outputs. Exploring jailbreak attacks enables us to investigate the vulnerabilities of LLMs and further guides us in enhancing their security. Unfortunately, existing techniques mainly rely on handcrafted templates or generated-based optimization, posing challenges in scalability, efficiency and universality. To address these issues, we present JailPO, a novel black-box jailbreak framework to examine LLM alignment. For scalability and universality, JailPO meticulously trains attack models to automatically generate covert jailbreak prompts. Furthermore, we introduce a preference optimization-based attack method to enhance the jailbreak effectiveness, thereby improving efficiency.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

JailPO: A Novel Black-Box Jailbreak Framework via Preference Optimization Against Aligned LLMs· underline

JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs· underline

Taxonomy

TopicsDigital and Cyber Forensics · Imbalanced Data Classification Techniques · Cybercrime and Law Enforcement Studies