Towards Generalizable Deepfake Image Detection with Vision Transformers
Kaliki V Srinanda, M Manvith Prabhu, Hemanth K Mogilipalem, Jayavarapu S Abhinai, Vaibhav Santhosh, Aryan Herur, Deepu Vijayasenan

TL;DR
This paper presents a novel ensemble approach using fine-tuned vision transformers to improve generalization in deepfake image detection, achieving state-of-the-art results on a challenging dataset.
Contribution
The authors introduce a vision transformer ensemble method that outperforms CNN baselines and previous algorithms in deepfake detection, demonstrating enhanced generalization.
Findings
Ensemble of vision transformers achieves 96.77% AUC on DF-Wild.
Outperforms state-of-the-art deepfake detection algorithms.
Winning solution at IEEE SP Cup 2025.
Abstract
In today's day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP's ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
