PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Avery Ma; Yangchen Pan; Amir-massoud Farahmand

arXiv:2502.01925·cs.CL·June 16, 2025

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Avery Ma, Yangchen Pan, Amir-massoud Farahmand

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces PANDAS, a hybrid method that enhances many-shot jailbreaking of large language models by using positive affirmations, negative demonstrations, and adaptive sampling to better exploit long-context vulnerabilities.

Contribution

The paper proposes PANDAS, a novel approach combining multiple techniques to improve the effectiveness of jailbreaking LLMs in long-context scenarios, and introduces the ManyHarm dataset for evaluation.

Findings

01

PANDAS significantly outperforms baseline methods in experiments.

02

Attention analysis reveals how long-context vulnerabilities are exploited.

03

PANDAS further improves upon existing jailbreaking techniques.

Abstract

Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

averyma/pandas
pytorchOfficial

Datasets

avery-ma/ManyHarm
dataset· 7 dl
7 dl

Videos

PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Imbalanced Data Classification Techniques · Digital and Cyber Forensics

MethodsSoftmax · Attention Is All You Need