TL;DR
This paper introduces In-Generation Detection (IGD), a novel method that uses the diffusion process's predicted noise to identify NSFW content during image generation, achieving over 91% accuracy.
Contribution
The paper presents a new in-generation NSFW detection approach that leverages diffusion model noise predictions, outperforming existing methods in accuracy.
Findings
IGD achieves 91.32% average detection accuracy.
IGD outperforms seven baseline methods.
Predicted noise captures semantic cues for NSFW detection.
Abstract
Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Clarity - The paper is well written and clearly structured. 2. Conceptual Simplicity - The method is conceptually simple yet effective. 3. Empirical Advantage - It substantially outperforms prior pre-generation NSFW detection methods. 4. Efficiency: Compared to post-generation detection approaches, IGD achieves conceptually faster detection though not shown.
1. Limited Novelty: The research contribution is somewhat limited in scope. While the proposed approach offers potential speed advantages by detecting NSFW content. Classifiers on noisy intermediate representations are not new. a. Given that the approach involves training a classifier on predicted noise, it would be interesting to explore whether this classifier could also be used as a guidance signal to steer generation away from NSFW regions, potentially broadening the impact of the method.
- NSFW content generation in T2I models is a legitimate safety concern that warrants research attention. - The paper evaluates against multiple adversarial attack methods, which is important for assessing robustness.
- The section 4.2 is titled "In-Generation Detection Method" but provides almost no concrete methodological details, it mostly repeats motivation from section 3. The statement "we train a lightweight binary classifier" is insufficient. - Figure 2 shows several pairwise t-SNE comparisons but conspicuously omits the most important comparison: SFW vs. naive NSFW vs. adversarial NSFW all together. The paper claims adversarial prompts produce similar noise patterns to naive NSFW prompts, but Figure
- The authors bring forward the notion of in-generation NSFW detection in diffusion models—monitoring predicted noise at intermediate denoising steps—whereas prior works focuses almost exclusively on pre-prompt and post-image detection. - Experimental results (see Table 1, Table 2, Table 3) demonstrate strong robustness and high accuracy (92.45% mean on naive and adversarial prompts) across multiple challenging NSFW categories, substantially outperforming seven recent baseline systems. - The cla
- The motivation for using predicted noise is supported mainly via qualitative t-SNE analyses (Figure 2 and related Appendix figures), but these visualizations are limited to a handful of classes. There is minimal theoretical discussion on why predicted noise at early timesteps reliably encodes semantic intent for all prompt regimes, especially as the denoising process is stochastic and intermediate signals could, at times, be altered by prompt perturbations. - Table 11 explores layer count, but
- The topic evaluated in this work is extremely important given the plethora of recent publicly available text-to-image models
My main weakness concentrate on the novelty of the proposed approach which in my opinion is extremely limited. “While existing methods primarily focus on pre-detection (prompt filtering) and post-detection (image moderation), the possibility of detecting NSFW content during the image generation process itself has, to our knowledge, been largely overlooked.” - This is simply not true. There is the whole branch of works exploring similar idea mostly with steering vectors. See for example: - Gaints
1. The paper is well-organized with clear logic, 2. Its method effectively detects NSFW content early during the generation process.
1. Can we understand it this way that they are indirectly classifying the prompts? Since the distribution of prompts themselves is different, classification on the noise are indirectly classifiying the prompts. Why not directly classify the text encoder output? How is this different from pre-detection? If we train a text classification model using i2p prompts with text encoder as input, I think it would still be useful. It means the ability and accuracy might come from the different of prompt in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
