Adaptive Instruction Composition for Automated LLM Red-Teaming

Jesse Zymet; Andy Luo; Swapnil Shinde; Sahil Wadhwa; Emily Chen

arXiv:2604.21159·cs.CR·April 24, 2026

Adaptive Instruction Composition for Automated LLM Red-Teaming

Jesse Zymet, Andy Luo, Swapnil Shinde, Sahil Wadhwa, Emily Chen

PDF

TL;DR

This paper presents a novel reinforcement learning framework for adaptive instruction composition to improve LLM red-teaming by balancing effectiveness and diversity of attack strategies.

Contribution

It introduces a reinforcement learning-based adaptive mechanism that optimizes instruction composition for more effective and diverse LLM red-teaming attacks.

Findings

01

Outperforms random instruction combination on effectiveness and diversity metrics.

02

Surpasses recent adaptive approaches on Harmbench benchmark.

03

Uses contrastive pretraining for rapid generalization and scalability.

Abstract

Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.