EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection
Hung Mai, Loi Dinh, Duc Hai Nguyen, Dat Do, Luong Doan, Khanh Nguyen Quoc, Huan Vu, Naeem Ul Islam, Tuan Do

TL;DR
EA-Swin is a novel embedding-agnostic transformer model that effectively detects AI-generated videos across diverse generators, outperforming existing methods and demonstrating strong generalization on a new large-scale benchmark dataset.
Contribution
We introduce EA-Swin, a new spatiotemporal transformer model compatible with generic video embeddings, and create EA-Video, a comprehensive dataset for evaluating AI-generated video detection.
Findings
EA-Swin achieves 0.97-0.99 accuracy across major generators.
Outperforms prior state-of-the-art methods by 5-20%.
Maintains strong generalization to unseen data.
Abstract
Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Moreover, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
