Towards multi-modal forgery representation learning for AI-generated video detection and localization
Dat Le, Khoa Nguyen, Xin Wang, Shu Hu

TL;DR
This paper presents a multi-modal learning architecture that improves detection and localization of AI-generated video forgeries by integrating visual and audio modalities.
Contribution
The novel architecture combines semantic, visual, and audio branches for enhanced detection and localization of manipulated videos.
Findings
Outperforms existing state-of-the-art methods in experiments.
Enables simultaneous detection and fine-grained temporal localization.
Effectively integrates multiple modalities for forgery analysis.
Abstract
Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
