HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu

arXiv:2512.23717·cs.CL·January 1, 2026

HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Shenzhe Zhu

PDF

Open Access

TL;DR

HarmTransform is a multi-agent debate framework designed to covertly transform harmful queries into stealthier forms, enhancing safety training for large language models by addressing subtle threats often missed by existing methods.

Contribution

The paper introduces HarmTransform, a novel multi-agent debate approach that systematically generates covert harmful query transformations to improve LLM safety alignment.

Findings

01

HarmTransform outperforms standard baselines in generating effective covert queries.

02

Debate enhances transformation quality but may cause topic shifts and added complexity.

03

Analysis reveals both benefits and limitations of multi-agent debate in safety data generation.

Abstract

Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Advanced Graph Neural Networks