Divide and Conquer: Multimodal Video Deepfake Detection via Cross-Modal Fusion and Localization

Qingcao Li; Miao He; Liang Yi; Qing Wen; Yitao Zhang; Hongshuo Jin; Peng Cheng; Zhongjie Ba; Li Lu; Kui Ren

arXiv:2602.00209·cs.MM·February 3, 2026

Divide and Conquer: Multimodal Video Deepfake Detection via Cross-Modal Fusion and Localization

Qingcao Li, Miao He, Liang Yi, Qing Wen, Yitao Zhang, Hongshuo Jin, Peng Cheng, Zhongjie Ba, Li Lu, Kui Ren

PDF

Open Access

TL;DR

This paper introduces a two-stage multimodal deepfake detection system that combines audio and visual analysis with score fusion to identify manipulated videos, achieving high accuracy on the DDL Challenge dataset.

Contribution

It proposes a novel multimodal score fusion strategy and integrates audio and visual localization modules for improved deepfake detection performance.

Findings

01

Achieved an AUC of 0.87 on the challenge test set

02

Developed a multimodal fusion strategy that enhances detection robustness

03

Integrated localization modules for pinpointing manipulated segments

Abstract

This paper presents a system for detecting fake audio-visual content (i.e., video deepfake), developed for Track 2 of the DDL Challenge. The proposed system employs a two-stage framework, comprising unimodal detection and multimodal score fusion. Specifically, it incorporates an audio deepfake detection module and an audio localization module to analyze and pinpoint manipulated segments in the audio stream. In parallel, an image-based deepfake detection and localization module is employed to process the visual modality. To effectively leverage complementary information across different modalities, we further propose a multimodal score fusion strategy that integrates the outputs from both audio and visual modules. Guided by a detailed analysis of the training and evaluation dataset, we explore and evaluate several score calculation and fusion strategies to improve system robustness.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Image Enhancement Techniques