Can VLMs Detect and Localize Fine-Grained AI-Edited Images?

Zhen Sun; Ziyi Zhang; Zeren Luo; Zhiyuan Zhong; Zeyang Sha; Tianshuo Cong; Zheng Li; Shiwen Cui; Weiqiang Wang; Jiaheng Wei; Xinlei He; Qi Li; Qian Wang

arXiv:2505.15644·cs.CV·December 4, 2025

Can VLMs Detect and Localize Fine-Grained AI-Edited Images?

Zhen Sun, Ziyi Zhang, Zeren Luo, Zhiyuan Zhong, Zeyang Sha, Tianshuo Cong, Zheng Li, Shiwen Cui, Weiqiang Wang, Jiaheng Wei, Xinlei He, Qi Li, Qian Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces FragFake, a large-scale benchmark for detecting and localizing AI-generated image edits, and evaluates vision language models' performance, highlighting their strengths and limitations.

Contribution

It develops FragFake, a comprehensive benchmark dataset, and systematically studies the effectiveness of vision language models in fine-grained edited-image detection and localization.

Findings

01

Pretrained VLMs perform poorly on the task.

02

Fine-tuned models like Qwen2.5-VL achieve high accuracy.

03

Data balancing and training domain significantly impact performance.

Abstract

Fine-grained detection and localization of localized image edits is crucial for assessing content authenticity, especially as modern diffusion models and image editors can produce highly realistic manipulations. However, this problem faces three key challenges: (1) most AIGC detectors produce only a global real-or-fake label without indicating where edits occur; (2) traditional computer vision methods for edit localization typically rely on costly pixel-level annotations; and (3) there is no large-scale, modern benchmark specifically targeting edited-image detection. To address these gaps, we develop an automated data-generation pipeline and construct FragFake, a large-scale benchmark of AI-edited images spanning multiple source datasets, diverse editing models, and several common edit types. Building on FragFake, we are the first to systematically study vision language models (VLMs)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Vincent-HKUSTGZ/FragFake
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsDiffusion