Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues
Marco Willi, Melanie Mathys, Michael Graber

TL;DR
This paper evaluates CLIP-based models for synthetic image detection, revealing they mainly rely on high-level visual cues rather than artifacts, and highlighting challenges in generalization across different generative models.
Contribution
The study introduces SynthCLIC, a new dataset to reduce semantic bias, and analyzes what CLIP-based detectors learn, emphasizing their reliance on semantic cues over artifacts.
Findings
CLIP detectors achieve high accuracy on GAN benchmarks.
Performance drops significantly on high-quality diffusion datasets.
Detectors rely more on semantic cues than generator artifacts.
Abstract
Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Face recognition and analysis
