What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
Chenyu Zhang

TL;DR
This paper introduces AHV-D extbackslash{}&S, a training-free method for detecting and suppressing risky content in Diffusion Transformer-based text-to-image models by analyzing attention head sensitivities.
Contribution
It proposes a novel inference-time safeguard that leverages attention head vectors to identify and mitigate risky content without retraining the model.
Findings
Effectively suppresses sexual, copyrighted, and harmful content.
Maintains high visual quality of generated images.
Robust against adversarial prompts and transferable across models.
Abstract
The rise of text-to-image (T2I) models has increasingly raised concerns regarding the generation of risky content, such as sexual, violent, and copyright-protected images, highlighting the need for effective safeguards within the models themselves. Although existing methods have been proposed to eliminate risky concepts from T2I models, they are primarily developed for earlier U-Net architectures, leaving the state-of-the-art Diffusion-Transformer-based T2I models inadequately protected. This gap stems from a fundamental architectural shift: Diffusion Transformers (DiTs) entangle semantic injection and visual synthesis via joint attention, which makes it difficult to isolate and erase risky content within the generation. To bridge this gap, we investigate how semantic concepts are represented in DiTs and discover that attention heads exhibit concept-specific sensitivity. This property…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
