CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

Hang Wang; Chao Shen; Chenhao Lin; Minghui Yang; Lei Zhang; and Cong Wang

arXiv:2605.00630·cs.CV·May 4, 2026

CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

Hang Wang, Chao Shen, Chenhao Lin, Minghui Yang, Lei Zhang, and Cong Wang

PDF

1 Repo

TL;DR

This paper introduces CMTA, a novel cross-modal detection framework that identifies AI-generated videos by analyzing unnatural temporal stability in semantic alignment using visual-textual cues.

Contribution

The work pioneers the use of cross-modal temporal artifacts for detecting AI-generated videos, leveraging joint visual-textual embeddings and multi-grained temporal modeling.

Findings

01

Outperforms existing methods on 40 dataset subsets

02

Achieves state-of-the-art detection accuracy

03

Demonstrates strong cross-generator generalization

Abstract

The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hwang-cs-ime/CMTA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.