How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

Simiao Ren; Yuchen Zhou; Xingyu Shen; Kidus Zewde; Tommy Duong; George Huang; Hatsanai (Neo) Tiangratanakul; Tsang (Dennis) Ng; En Wei; Jiayu Xue

arXiv:2602.07814·cs.CV·February 10, 2026

How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

Simiao Ren, Yuchen Zhou, Xingyu Shen, Kidus Zewde, Tommy Duong, George Huang, Hatsanai (Neo) Tiangratanakul, Tsang (Dennis) Ng, En Wei, Jiayu Xue

PDF

Open Access

TL;DR

This comprehensive benchmark study evaluates the out-of-the-box performance of 16 state-of-the-art AI-generated image detectors across diverse datasets, revealing significant variability, lack of universal effectiveness, and the importance of data alignment for real-world deployment.

Contribution

First zero-shot evaluation of multiple pretrained detectors across diverse datasets, highlighting their instability and the impact of training data alignment on performance.

Findings

01

No universal detector outperforms others across all datasets.

02

Performance gap of 37 percentage points between best and worst detectors.

03

Modern commercial generators often evade detection, with only 18-30% accuracy.

Abstract

As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis