TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection
Ahmed Abdullah, Nikolas Ebert, Oliver Wasenm\"uller

TL;DR
This paper benchmarks various vision foundation models for AI-generated image detection, introduces a tunable attention pooling method to enhance detection accuracy, and establishes new state-of-the-art results.
Contribution
It systematically evaluates multiple VFMs for AIGI detection and proposes TAP, a simple classifier redesign, to significantly improve detection performance.
Findings
Best model outperforms CLIP by over 12% in accuracy.
TAP improves detection performance across multiple benchmarks.
Achieves new state-of-the-art in in-the-wild AI image detection.
Abstract
Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
