Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning

Wenbo Xu; Wei Lu; Xiangyang Luo

arXiv:2508.02179·cs.CV·August 5, 2025

Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning

Wenbo Xu, Wei Lu, Xiangyang Luo

PDF

Open Access

TL;DR

This paper introduces WMMT, a weakly supervised multimodal approach using multitask learning and a Mixture-of-Experts structure to accurately localize Deepfake forgeries in videos with only video-level labels.

Contribution

It proposes a novel multitask learning framework with a Mixture-of-Experts model and a deviation perceiving loss for fine-grained temporal forgery localization using weak supervision.

Findings

01

WMMT achieves localization accuracy comparable to fully supervised methods.

02

The proposed approach effectively integrates audio-visual modalities for Deepfake detection.

03

Experimental results validate the robustness and flexibility of the multitask learning paradigm.

Abstract

The spread of Deepfake videos has caused a trust crisis and impaired social stability. Although numerous approaches have been proposed to address the challenges of Deepfake detection and localization, there is still a lack of systematic research on the weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL). In this paper, we propose a novel weakly supervised multimodal temporal forgery localization via multitask learning (WMMT), which addresses the WS-MTFL under the multitask learning paradigm. WMMT achieves multimodal fine-grained Deepfake detection and temporal partial forgery localization using merely video-level annotations. Specifically, visual and audio modality detection are formulated as two binary classification tasks. The multitask learning paradigm is introduced to integrate these tasks into a multimodal task. Furthermore, WMMT utilizes a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Emotion and Mood Recognition