Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Yikun Ji; Hong Yan; Jun Lan; Huijia Zhu; Weiqiang Wang; Qi Fan; Liqing Zhang; Jianfu Zhang

arXiv:2506.07045·cs.CV·June 10, 2025

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Yikun Ji, Hong Yan, Jun Lan, Huijia Zhu, Weiqiang Wang, Qi Fan, Liqing Zhang, Jianfu Zhang

PDF

Open Access

TL;DR

This paper develops a grounded reasoning approach using fine-tuned Multi-modal Large Language Models to detect AI-generated images, providing both high accuracy and human-understandable explanations.

Contribution

It introduces a new dataset with annotations and a multi-stage fine-tuning strategy to improve detection accuracy and interpretability of MLLMs for AI-generated image detection.

Findings

01

Outperforms baseline detection methods

02

Provides accurate localization of synthesis artifacts

03

Generates coherent textual explanations of visual flaws

Abstract

The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)