TL;DR
This paper introduces CMTA, a novel cross-modal detection framework that identifies AI-generated videos by analyzing unnatural temporal stability in semantic alignment using visual-textual cues.
Contribution
The work pioneers the use of cross-modal temporal artifacts for detecting AI-generated videos, leveraging joint visual-textual embeddings and multi-grained temporal modeling.
Findings
Outperforms existing methods on 40 dataset subsets
Achieves state-of-the-art detection accuracy
Demonstrates strong cross-generator generalization
Abstract
The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
