Weakly Supervised Multimodal Temporal Forgery Localization via Multitask Learning
Wenbo Xu, Wei Lu, Xiangyang Luo

TL;DR
This paper introduces WMMT, a weakly supervised multimodal approach using multitask learning and a Mixture-of-Experts structure to accurately localize Deepfake forgeries in videos with only video-level labels.
Contribution
It proposes a novel multitask learning framework with a Mixture-of-Experts model and a deviation perceiving loss for fine-grained temporal forgery localization using weak supervision.
Findings
WMMT achieves localization accuracy comparable to fully supervised methods.
The proposed approach effectively integrates audio-visual modalities for Deepfake detection.
Experimental results validate the robustness and flexibility of the multitask learning paradigm.
Abstract
The spread of Deepfake videos has caused a trust crisis and impaired social stability. Although numerous approaches have been proposed to address the challenges of Deepfake detection and localization, there is still a lack of systematic research on the weakly supervised multimodal fine-grained temporal forgery localization (WS-MTFL). In this paper, we propose a novel weakly supervised multimodal temporal forgery localization via multitask learning (WMMT), which addresses the WS-MTFL under the multitask learning paradigm. WMMT achieves multimodal fine-grained Deepfake detection and temporal partial forgery localization using merely video-level annotations. Specifically, visual and audio modality detection are formulated as two binary classification tasks. The multitask learning paradigm is introduced to integrate these tasks into a multimodal task. Furthermore, WMMT utilizes a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Emotion and Mood Recognition
