Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Xiaohong Liu; Xiufeng Song; Huayu Zheng; Lei Bai; Xiaoming Liu; Guangtao Zhai

arXiv:2511.18104·cs.CV·November 25, 2025

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Xiaohong Liu, Xiufeng Song, Huayu Zheng, Lei Bai, Xiaoming Liu, Guangtao Zhai

PDF

Open Access

TL;DR

This paper introduces MM-Det++, a novel multimodal detection framework for identifying diffusion-generated videos, combining spatio-temporal analysis with multimodal reasoning and a unified learning module, supported by a new large-scale dataset.

Contribution

The paper presents a new unified multimodal detection algorithm for diffusion videos, integrating a vision transformer and multimodal reasoning, along with a comprehensive dataset for training and evaluation.

Findings

01

MM-Det++ outperforms existing methods in diffusion video detection.

02

The unified multimodal approach improves generalization and robustness.

03

Extensive experiments validate the effectiveness of the proposed framework.

Abstract

The proliferation of videos generated by diffusion models has raised increasing concerns about information security, highlighting the urgent need for reliable detection of synthetic media. Existing methods primarily focus on image-level forgery detection, leaving generic video-level forgery detection largely underexplored. To advance video forensics, we propose a consolidated multimodal detection algorithm, named MM-Det++, specifically designed for detecting diffusion-generated videos. Our approach consists of two innovative branches and a Unified Multimodal Learning (UML) module. Specifically, the Spatio-Temporal (ST) branch employs a novel Frame-Centric Vision Transformer (FC-ViT) to aggregate spatio-temporal information for detecting diffusion-generated videos, where the FC-tokens enable the capture of holistic forgery traces from each video frame. In parallel, the Multimodal (MM)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques