RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models
Jiale Ding, Xiang Zheng, Yutao Wu, Cong Wang, Wei-Bin Lee, Ling Pan, Xingjun Ma, Yu-Gang Jiang

TL;DR
RedTopic introduces a novel framework for generating diverse and effective adversarial prompts to improve the safety and robustness of large language models through adaptive red teaming.
Contribution
It presents a new method combining contextualized prompt generation, aggregate rewards, and multi-objective reinforcement learning for topic-diverse red teaming.
Findings
RedTopic outperforms existing methods in prompt diversity and effectiveness.
It achieves significant improvements in safety evaluation metrics.
The framework enhances adaptability in red teaming processes.
Abstract
As large language models (LLMs) are increasingly deployed as black-box components in real-world applications, red teaming has become essential for identifying potential risks. It tests LLMs with adversarial prompts to uncover vulnerabilities and improve safety alignment. Ideally, effective red teaming should be adaptive to evolving LLM capabilities and explore a broad range of harmful topics. However, existing approaches face two limitations: 1) topic-based approaches rely on pre-collected harmful topics, limited in flexibility and adaptivity. 2) topic-free methods use reinforcement learning (RL), but they lack an explicit reward signal for exploration and tend to over-optimize a narrow objective, reducing topic diversity. To address these limitations, we propose RedTopic, a novel red teaming framework that generates topic-diverse adversarial prompts through a contextualized generation…
Peer Reviews
Decision·Submitted to ICLR 2026
Originality: Introduces an explicit topic-diversity objective and validates that token/sentence diversity is insufficient (Fig. 1/2). The contextualized pipeline usefully grounds attacks in realistic scenarios. Quality: Solid ablations: swap scenarios→topics, remove consistency reward, PPO vs MOPPO, reward-combination variants; helpful diagnostics (threshold penalty, generation length). Clarity: Equations and pipeline diagram make the approach legible; tables cover both success and integrated ac
Evaluator dependence / circularity: 1. Topic embeddings from a single guard (LLaMA-Guard-3-1B) define the diversity metric; this risks metric overfitting and model-of-a-model biases. Fig. 1(a) contrasts CLIP vs guard, but breadth across multiple guards (and non-guard topic models) is limited. 2. LLM-as-Judge toxicity introduces variance (acknowledged via judge comparisons), yet the main results hinge on it; stronger calibration (human labels, consensus of judges) would increase trust. External v
They show a benefit from fine tuning a model on the adversarial prompts with rejection responses. This shows the usefulness of the technique. In general, I think increasing the diversity of topics covered in red teaming is important. Their ablation studies test the relevant parts of their method individually so the impact is clear. Their results show that their method is best if you care about the number of distinct topic level vulnerabilities discovered.
The D_sent% and D_token% scores from the baselines are higher. So if you care about the number of unique sentence or token level vulnerabilities, the baselines are better than their method. The way they combine the scores seems arbitrary. For example, why do they include toxic, topic, and consis scores in one harmonic mean, but include token, sent scores in another harmonic mean? MOPPO is not well motivated. MOPPO calculates the advantage as a weighted advantage of the individual advantages. A
1. Red teaming is crucial for LLMs, and increasing the topic coverage is important. 2. Experiments show the method can find a balance between the ASR and the topic diversity. 3. Experiments cover both topic-based and topic-free baselines. 4. The ablation studies are conducted for different designs.
1. The topic-diversity definition should be validated externally (e.g., by a different model or human). There is no evidence to support that this embedding space indeed reflects topics rather than toxicity or semantic similarity. 2. It's unclear whether the models used to define the reward and evaluate the final performance are the same. If so, this is like "evaluation leakage". It would be better to use metrics different from the optimization target. For example, use a different model. 3. Sim
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
