NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization
Yifei Li, Haoyuan He, Yu Zheng, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jie Zhou, Jiwen Lu

TL;DR
This paper introduces NeXT-IMDL, a comprehensive benchmark for evaluating the generalization of image manipulation detection and localization models across diverse, real-world scenarios, revealing their systemic weaknesses.
Contribution
It presents a large-scale diagnostic benchmark with evaluation protocols that systematically test the robustness of IMDL models across multiple manipulation axes.
Findings
Current models perform well in original settings but fail under diverse evaluation protocols.
Significant performance degradation observed in models when tested on cross-dimension scenarios.
NeXT-IMDL reveals systemic weaknesses in existing IMDL methods, guiding future robustness improvements.
Abstract
The accessibility surge and abuse risks of user-friendly image editing models have created an urgent need for generalizable, up-to-date methods for Image Manipulation Detection and Localization (IMDL). Current IMDL research typically uses cross-dataset evaluation, where models trained on one benchmark are tested on others. However, this simplified evaluation approach conceals the fragility of existing methods when handling diverse AI-generated content, leading to misleading impressions of progress. This paper challenges this illusion by proposing NeXT-IMDL, a large-scale diagnostic benchmark designed not just to collect data, but to probe the generalization boundaries of current detectors systematically. Specifically, NeXT-IMDL categorizes AIGC-based manipulations along four fundamental axes: editing models, manipulation types, content semantics, and forgery granularity. Built upon…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The dataset is a large scale dataset that includes a wide range of manipulation techniques across 4 axes of diversity with 3 different data sources.
* In Figure 1 the authors speak about evaluating on 5 protocols, however the Protocol 5: Toward-Realworld-IMDL does not appear to be present as a separate table in the main paper considering this is mentioned in the abstract and Figure 1, this seems very misleading if it is not present in the main paper. The point of Protocol 5 is to see the “lab-to-wild” performance, seeing how these models perform on commercial tools, the fact that this is not present is what I believe is a major oversight. *
1. The paper introduces NeXT-IMDL, a systematic benchmark that explicitly probes four key axes of IMDL generalization. 2. It provides a comprehensive dataset (558K samples) with 32 editing tools covering both academic and commercial generators. 3. The study delivers valuable empirical insights. 4. The experiments are extensive, involving representative IMDL models across five evaluation protocols. 5. The open-science commitment (planned dataset, code, and experiments release) enhances reproducib
1. The sections are dense, with complex cross-protocol discussions that may overwhelm general readers. Readability could be improved.
The benchmark is extensive, incorporating a wide variety of editing models (32) and covering multiple manipulation types and conditions. And the five cross-domain protocols are well-designed and effectively probe model weaknesses in generalization.
The work is primarily a dataset and evaluation framework. Its goal and methodology are very similar to prior benchmarks like IMDL-BenCo and GRE, making the conceptual advance somewhat limited. And the paper does not propose a new detection model or a novel theoretical insight; it is mainly a "stress test" for existing methods. Besides, some presentation issues exist, such as "Towars" in Figure 1 and the truncated label "Rem." in Line 111/Table 1, which detract from the overall polish.
1. The authors strategically identify four critical failure modes that hinder IMDL generalization—cross-edit models, cross-edit types, cross-semantic labels, and cross-edit granularity. This formulation provides clear diagnostic dimensions and is experimentally well-supported, offering strong intuition for guiding future improvements in IMDL robustness. 2. The scale of the IMDL evaluation is impressive, encompassing 11 state-of-the-art models tested across 32 editing techniques, including recent
Major: 1. There are already comprehensive benchmark evaluations for IMDL in [1, 2] and for deepfake detection in [3], following a similar methodology to the proposed work. However, the authors neither compare their results with these prior benchmarks nor clarify how Next-IMDL improves upon IMDL-Benco [1]. Given the title Next-IMDL, it is reasonable to assume they are aware of the previous benchmarks, yet no direct comparison or analysis of improvements is presented. Is the inclusion of AIGC-base
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
