Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
Zhengcen Li, Chenyang Jiang, Liangxu Su, Tong Shao, Shiyang Zhou, Ming Tao, Jingyong Su

TL;DR
VINA is a unified detection framework that enhances AI-generated image and video detection by training on both modalities and aligning their representations, significantly improving robustness and transferability.
Contribution
The paper introduces VINA, a novel approach that jointly trains on images and videos, addressing cross-modal gaps and achieving state-of-the-art detection performance.
Findings
VINA improves detection robustness across diverse benchmarks.
It achieves state-of-the-art results without complex tuning.
VINA enhances transferability between image and video detection tasks.
Abstract
AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
