Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection
Daichi Zhang, Tong Zhang, Jianmin Bao, Shiming Ge, Sabine S\"usstrunk

TL;DR
This paper introduces ITEM, a novel fake image detection method that leverages hierarchical image-text misalignment in a joint visual-language space, improving generalization over existing visual-only approaches.
Contribution
The paper proposes a multi-modal detection approach using hierarchical image-text misalignment in CLIP space, enhancing robustness and generalization in fake image detection.
Findings
Outperforms state-of-the-art methods in generalization
Effective in detecting images from unseen generative models
Robust against various image manipulations
Abstract
With the rapid development of generative models, detecting generated fake images to prevent their malicious use has become a critical issue recently. Existing methods frame this challenge as a naive binary image classification task. However, such methods focus only on visual clues, yielding trained detectors susceptible to overfitting specific image patterns and incapable of generalizing to unseen models. In this paper, we address this issue from a multi-modal perspective and find that fake images cannot be properly aligned with corresponding captions compared to real images. Upon this observation, we propose a simple yet effective detector termed ITEM by leveraging the image-text misalignment in a joint visual-language space as discriminative clues. Specifically, we first measure the misalignment of the images and captions in pre-trained CLIP's space, and then tune a MLP head to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Misinformation and Its Impacts · Adversarial Robustness in Machine Learning
