Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

Marco Willi; Melanie Mathys; Michael Graber

arXiv:2602.12381·cs.CV·February 16, 2026

Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

Marco Willi, Melanie Mathys, Michael Graber

PDF

Open Access

TL;DR

This paper evaluates CLIP-based models for synthetic image detection, revealing they mainly rely on high-level visual cues rather than artifacts, and highlighting challenges in generalization across different generative models.

Contribution

The study introduces SynthCLIC, a new dataset to reduce semantic bias, and analyzes what CLIP-based detectors learn, emphasizing their reliance on semantic cues over artifacts.

Findings

01

CLIP detectors achieve high accuracy on GAN benchmarks.

02

Performance drops significantly on high-quality diffusion datasets.

03

Detectors rely more on semantic cues than generator artifacts.

Abstract

Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Face recognition and analysis