Weak-to-Strong Jailbreaking on Large Language Models

Xuandong Zhao; Xianjun Yang; Tianyu Pang; Chao Du; Lei Li; Yu-Xiang Wang; William Yang Wang

arXiv:2401.17256·cs.CL·July 25, 2025·2 cites

Weak-to-Strong Jailbreaking on Large Language Models

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces an efficient inference-time attack called weak-to-strong jailbreaking that exploits differences in decoding distributions to produce harmful outputs from aligned large language models, highlighting safety vulnerabilities.

Contribution

The paper presents a novel, computationally efficient attack method for LLMs that significantly increases misalignment rates using a two-model adversarial approach.

Findings

01

Over 99% misalignment rate achieved on two datasets

02

Method effective on 5 diverse open-source LLMs

03

One forward pass per example suffices for attack

Abstract

Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack's key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model's decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xuandongzhao/weak-to-strong
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Privacy-Preserving Technologies in Data