Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection

Daichi Zhang; Tong Zhang; Jianmin Bao; Shiming Ge; Sabine S\"usstrunk

arXiv:2511.00427·cs.CV·November 4, 2025

Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection

Daichi Zhang, Tong Zhang, Jianmin Bao, Shiming Ge, Sabine S\"usstrunk

PDF

Open Access

TL;DR

This paper introduces ITEM, a novel fake image detection method that leverages hierarchical image-text misalignment in a joint visual-language space, improving generalization over existing visual-only approaches.

Contribution

The paper proposes a multi-modal detection approach using hierarchical image-text misalignment in CLIP space, enhancing robustness and generalization in fake image detection.

Findings

01

Outperforms state-of-the-art methods in generalization

02

Effective in detecting images from unseen generative models

03

Robust against various image manipulations

Abstract

With the rapid development of generative models, detecting generated fake images to prevent their malicious use has become a critical issue recently. Existing methods frame this challenge as a naive binary image classification task. However, such methods focus only on visual clues, yielding trained detectors susceptible to overfitting specific image patterns and incapable of generalizing to unseen models. In this paper, we address this issue from a multi-modal perspective and find that fake images cannot be properly aligned with corresponding captions compared to real images. Upon this observation, we propose a simple yet effective detector termed ITEM by leveraging the image-text misalignment in a joint visual-language space as discriminative clues. Specifically, we first measure the misalignment of the images and captions in pre-trained CLIP's space, and then tune a MLP head to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Misinformation and Its Impacts · Adversarial Robustness in Machine Learning