TraSCE: Trajectory Steering for Concept Erasure

Anubhav Jain; Yuya Kobayashi; Takashi Shibuya; Yuhta Takida; Nasir; Memon; Julian Togelius; Yuki Mitsufuji

arXiv:2412.07658·cs.CV·March 18, 2025

TraSCE: Trajectory Steering for Concept Erasure

Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir, Memon, Julian Togelius, Yuki Mitsufuji

PDF

Open Access 1 Repo 3 Reviews

TL;DR

TraSCE introduces a novel, training-free method using refined negative prompting and localized guidance to effectively steer diffusion models away from generating harmful content, surpassing previous techniques in safety and concept erasure.

Contribution

The paper presents a new concept erasure technique that improves negative prompting with localized guidance, without requiring model retraining or data, enhancing safety in diffusion models.

Findings

01

Achieves state-of-the-art results on harmful content removal benchmarks.

02

Effectively erases artistic styles and objects from generated images.

03

Does not require training, weights modification, or additional data.

Abstract

Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing such safety measures. In this paper, we propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our approach is based on negative prompting, but as we show in this paper, a widely used negative prompting strategy is not a complete solution and can easily be bypassed in some corner cases. To address this issue, we first propose using a specific formulation of negative prompting instead of the widely used one. Furthermore, we introduce a localized…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

**S1:** TraSCE operates at inference time, eliminating the need for costly retraining or data collection. This makes it easily deployable for model owners to adapt to new concepts. **S2:** TraSCE shows significant reductions in attack success rates against black-box adversarial attacks, with minimal degradation in general image quality. **S3:** Experiments cover diverse erasure tasks using multiple metrics, providing a broad assessment of the method's applicability.

Weaknesses

**W1:** The main concern about this paper is its limited novelty and insufficient distinction from prior work. The core component of TraSCE, the modified negative prompting, is adapted from Liu et al. (2022) on concept negation but lacks adequate justification for its novelty. While the addition of localized loss-based guidance is claimed as new, it fails to be differentiated from existing guidance techniques, such as classifier guidance. For instance, Schramowski et al. (2023) also employ traje

Reviewer 02Rating 4Confidence 5

Strengths

1. The method is very clear and easy to understand. 2. The proposed method performs excellently and shows outstanding results on multiple evaluation benchmarks. 3. The authors' experimental setup is comprehensive, taking into account various evaluation tasks, erasure robustness, and different base models.

Weaknesses

1. The application of the proposed method seems to be based on an unreasonable setting: that the specific category of harmful content must be predefined for the current generation. This is impractical in real-world scenarios. In contrast, recent related works [1, 2, 3] adopt a "detect-then-erase" mechanism, which first determines if a specific concept has been generated and only then performs concept erasure. This appears to be a more reasonable setup. 2. The additional generation time introduce

Reviewer 03Rating 4Confidence 3

Strengths

1.  Addresses safety in diffusion models, especially robustness to prompt‑based jailbreaks. 2. Integrates a geometric “trajectory steering” view into the diffusion process, offering intuitive control over latent evolution.

Weaknesses

1. While “trajectory steering” offers a coherent new perspective, the implementation closely resembles classifier-free or loss-based guidance mechanisms already explored in prior works (e.g., SLD). The novelty primarily lies in problem framing and loss design rather than theoretical advancement. 2. The per-step gradient update increases sampling time by 2–3×, which may limit deployment for large-scale or real-time use.

Code & Models

Repositories

anubhav1997/trasce
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies

MethodsDiffusion