HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate
Shenzhe Zhu

TL;DR
HarmTransform is a multi-agent debate framework designed to covertly transform harmful queries into stealthier forms, enhancing safety training for large language models by addressing subtle threats often missed by existing methods.
Contribution
The paper introduces HarmTransform, a novel multi-agent debate approach that systematically generates covert harmful query transformations to improve LLM safety alignment.
Findings
HarmTransform outperforms standard baselines in generating effective covert queries.
Debate enhances transformation quality but may cause topic shifts and added complexity.
Analysis reveals both benefits and limitations of multi-agent debate in safety data generation.
Abstract
Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Advanced Graph Neural Networks
