Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei; Hanxuan Chen; Peilu Hu; Zhenyuan Wei; Chenwei Liang; Jing Luo; Ziyi Ni; Hao Yan; Li Mei; Shengning Lang; Kuan Lu; Xi Xiao; Zhimo Han; Yijin Wang; Yichao Zhang; Chen Yang; Junfeng Hao; Jiayi Gu; Riyang Bao; Mu-Jiang-Shan Wang

arXiv:2512.20677·cs.CR·April 29, 2026

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei, Hanxuan Chen, Peilu Hu, Zhenyuan Wei, Chenwei Liang, Jing Luo, Ziyi Ni, Hao Yan, Li Mei, Shengning Lang, Kuan Lu, Xi Xiao, Zhimo Han, Yijin Wang, Yichao Zhang, Chen Yang, Junfeng Hao, Jiayi Gu, Riyang Bao, Mu-Jiang-Shan Wang

PDF

TL;DR

This paper introduces a learning-based automated red-teaming framework for large language models, significantly improving vulnerability discovery efficiency and coverage over manual methods.

Contribution

It formulates automated adversarial testing as a structured search problem and combines meta-prompt-guided generation with hierarchical detection for scalable robustness evaluation.

Findings

01

Identified 47 vulnerabilities, including 21 high-severity failures.

02

Achieved 3.9 times higher vulnerability discovery rate than manual red-teaming.

03

Demonstrated superior coverage, efficiency, and reproducibility in robustness evaluation.

Abstract

The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices are largely manual and expert-driven, which limits scalability, reproducibility, and coverage in high-dimensional prompt spaces. We formulate automated LLM red-teaming as a structured adversarial search problem and propose a learning-driven framework for scalable vulnerability discovery. The approach combines meta-prompt-guided adversarial prompt generation with a hierarchical execution and detection pipeline, enabling standardized evaluation across six representative threat categories, including reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Extensive experiments on GPT-OSS-20B identify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.