DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection
Hai Ci, Ziheng Peng, Pei Yang, Yingxin Xuan, Mike Zheng Shou

TL;DR
DiffSeg30k introduces a large, detailed dataset for localized detection of diffusion-based image edits, shifting focus from binary classification to pixel-level segmentation to improve fine-grained AIGC detection.
Contribution
The paper presents DiffSeg30k, a novel dataset with pixel-level annotations for diffusion edit localization, and benchmarks segmentation approaches, highlighting their potential and challenges.
Findings
Segmentation models outperform traditional forgery classifiers in detecting diffusion edits.
Robustness to image distortions remains a significant challenge for segmentation methods.
Segmentation models show promise in cross-generator generalization for AIGC detection.
Abstract
Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images--we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models--local edits using eight SOTA diffusion models; 3) Multi-turn editing--each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios--a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The paper constructs a dataset that moves beyond single-shot edits to support multi-turn editing workflows. * DiffSeg30K provides an easily reproducible pipeline using only pre-trained models such as VLM (e.g., Qwen2.5-VL), open-vocabulary segmentation (e.g., Grounded-SAM), and text-to-image diffusion models (e.g., SDXL) without requiring additional training.
**1. Limited novelty in dataset pipeline.** The current pipeline relies on vision-language models (e.g., Qwen2.5-VL) and open-vocabulary segmentation (e.g., Grounded-SAM) to generate pseudo labels before multi-turn editing. Similar pseudo-labeling strategies that couple region mining for text-to-image diffusion models are already common in existing studies [1, 2]. For technical contribution, Please consider a different labeling mechanism or provide sufficient analysis strongly related to AIGC ta
+ The paper proposes a new (to my knowledge) perspective that considers fine-grained localization and model attribution of diffusion-based image edits for an arbitrary image, instead of doing binary classification between real and AI-generated images. + The paper proposes a corresponding automatic pipeline for collecting annotated images and present a 30k dataset suitable for the proposed tasks. + The evaluation looks abundant on FCN-8s and deeplab-v3 with additional results on binary ai content
- Although the paper proposes a new point of view moving from binary detection to fine-grained model attribution, can the author comment on how this new task could concretely further benefit the research community? - The current baseline results focus primarily on fcn-8s and deeplab-v3 models. We can observe significant improvements in terms of detection rates with better architectures. Therefore what if we use more advanced architectures and models such as ViT based detectors or other segmentat
1. The paper defines a practically relevant and comprehensive task, covering AIGC detection, localization, and model attribution. 2. The dataset design is systematic, simulating multi-turn edits across eight diffusion models via an automated pipeline.
1. The dataset is not the first to address AIGC detection, localization, and attribution. The paper should more carefully discuss or experimentally compare against existing datasets, clarifying its unique contributions. 2. The automatic annotation pipeline may introduce noise and bias. Although low-quality samples are filtered using Qwen2.5-VL, the paper does not quantify annotation accuracy. The fact that roughly 50% of samples were discarded raises concerns about the stability of the generatio
- Comprehensive benchmark addressing a key unmet need in AIGC detection - Automated, reproducible pipeline combining VLMs and diffusion models - Multi-turn editing and diverse diffusion sources add realism - Transparent evaluation of baseline performance and robustness
- The proposed benchmark focuses mainly on localized edits and does not effectively address global or stylistic transformations, which are also important for comprehensive AIGC detection. - While object addition and removal edits are valid operations, they may be less representative of subtle AIGC manipulations compared to attribute-change edits that alter visual details without structural changes. - Despite the automated quality filtering process, some residual artifacts or dataset biases may p
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Cell Image Analysis Techniques
