TraSCE: Trajectory Steering for Concept Erasure
Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir, Memon, Julian Togelius, Yuki Mitsufuji

TL;DR
TraSCE introduces a novel, training-free method using refined negative prompting and localized guidance to effectively steer diffusion models away from generating harmful content, surpassing previous techniques in safety and concept erasure.
Contribution
The paper presents a new concept erasure technique that improves negative prompting with localized guidance, without requiring model retraining or data, enhancing safety in diffusion models.
Findings
Achieves state-of-the-art results on harmful content removal benchmarks.
Effectively erases artistic styles and objects from generated images.
Does not require training, weights modification, or additional data.
Abstract
Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing such safety measures. In this paper, we propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our approach is based on negative prompting, but as we show in this paper, a widely used negative prompting strategy is not a complete solution and can easily be bypassed in some corner cases. To address this issue, we first propose using a specific formulation of negative prompting instead of the widely used one. Furthermore, we introduce a localized…
Peer Reviews
Decision·Submitted to ICLR 2026
**S1:** TraSCE operates at inference time, eliminating the need for costly retraining or data collection. This makes it easily deployable for model owners to adapt to new concepts. **S2:** TraSCE shows significant reductions in attack success rates against black-box adversarial attacks, with minimal degradation in general image quality. **S3:** Experiments cover diverse erasure tasks using multiple metrics, providing a broad assessment of the method's applicability.
**W1:** The main concern about this paper is its limited novelty and insufficient distinction from prior work. The core component of TraSCE, the modified negative prompting, is adapted from Liu et al. (2022) on concept negation but lacks adequate justification for its novelty. While the addition of localized loss-based guidance is claimed as new, it fails to be differentiated from existing guidance techniques, such as classifier guidance. For instance, Schramowski et al. (2023) also employ traje
1. The method is very clear and easy to understand. 2. The proposed method performs excellently and shows outstanding results on multiple evaluation benchmarks. 3. The authors' experimental setup is comprehensive, taking into account various evaluation tasks, erasure robustness, and different base models.
1. The application of the proposed method seems to be based on an unreasonable setting: that the specific category of harmful content must be predefined for the current generation. This is impractical in real-world scenarios. In contrast, recent related works [1, 2, 3] adopt a "detect-then-erase" mechanism, which first determines if a specific concept has been generated and only then performs concept erasure. This appears to be a more reasonable setup. 2. The additional generation time introduce
1. Addresses safety in diffusion models, especially robustness to prompt‑based jailbreaks. 2. Integrates a geometric “trajectory steering” view into the diffusion process, offering intuitive control over latent evolution.
1. While “trajectory steering” offers a coherent new perspective, the implementation closely resembles classifier-free or loss-based guidance mechanisms already explored in prior works (e.g., SLD). The novelty primarily lies in problem framing and loss design rather than theoretical advancement. 2. The per-step gradient update increases sampling time by 2–3×, which may limit deployment for large-scale or real-time use.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
MethodsDiffusion
