Multi-modal Deepfake Detection and Localization with FPN-Transformer
Chende Zheng, Ruiqi Suo, Zhoulin Ji, Jingyi Deng, Fangbin Yi, Chenhao Lin, Chao Shen

TL;DR
This paper presents a multi-modal deepfake detection and localization framework using FPN-Transformer, leveraging cross-modal features and temporal boundary regression to improve detection accuracy and localization precision in audio-visual deepfakes.
Contribution
It introduces a novel multi-modal framework with a Feature Pyramid-Transformer that enhances cross-modal generalization and precise localization of manipulated segments in deepfake videos.
Findings
Achieved a detection and localization score of 0.7535 on the IJCAI'25 DDL-AV benchmark.
Effectively leverages pre-trained self-supervised models for hierarchical feature extraction.
Demonstrated improved performance over existing unimodal methods in challenging environments.
Abstract
The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Speech and Audio Processing
