TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

Ahmed Abdullah; Nikolas Ebert; Oliver Wasenm\"uller

arXiv:2604.26772·cs.CV·April 30, 2026

TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

Ahmed Abdullah, Nikolas Ebert, Oliver Wasenm\"uller

PDF

TL;DR

This paper benchmarks various vision foundation models for AI-generated image detection, introduces a tunable attention pooling method to enhance detection accuracy, and establishes new state-of-the-art results.

Contribution

It systematically evaluates multiple VFMs for AIGI detection and proposes TAP, a simple classifier redesign, to significantly improve detection performance.

Findings

01

Best model outperforms CLIP by over 12% in accuracy.

02

TAP improves detection performance across multiple benchmarks.

03

Achieves new state-of-the-art in in-the-wild AI image detection.

Abstract

Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.