Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

Zhengcen Li; Chenyang Jiang; Liangxu Su; Tong Shao; Shiyang Zhou; Ming Tao; Jingyong Su

arXiv:2605.21977·cs.CV·May 22, 2026

Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

Zhengcen Li, Chenyang Jiang, Liangxu Su, Tong Shao, Shiyang Zhou, Ming Tao, Jingyong Su

PDF

TL;DR

VINA is a unified detection framework that enhances AI-generated image and video detection by training on both modalities and aligning their representations, significantly improving robustness and transferability.

Contribution

The paper introduces VINA, a novel approach that jointly trains on images and videos, addressing cross-modal gaps and achieving state-of-the-art detection performance.

Findings

01

VINA improves detection robustness across diverse benchmarks.

02

It achieves state-of-the-art results without complex tuning.

03

VINA enhances transferability between image and video detection tasks.

Abstract

AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.