Multi-modal Deepfake Detection and Localization with FPN-Transformer

Chende Zheng; Ruiqi Suo; Zhoulin Ji; Jingyi Deng; Fangbin Yi; Chenhao Lin; Chao Shen

arXiv:2511.08031·cs.CV·November 12, 2025

Multi-modal Deepfake Detection and Localization with FPN-Transformer

Chende Zheng, Ruiqi Suo, Zhoulin Ji, Jingyi Deng, Fangbin Yi, Chenhao Lin, Chao Shen

PDF

Open Access

TL;DR

This paper presents a multi-modal deepfake detection and localization framework using FPN-Transformer, leveraging cross-modal features and temporal boundary regression to improve detection accuracy and localization precision in audio-visual deepfakes.

Contribution

It introduces a novel multi-modal framework with a Feature Pyramid-Transformer that enhances cross-modal generalization and precise localization of manipulated segments in deepfake videos.

Findings

01

Achieved a detection and localization score of 0.7535 on the IJCAI'25 DDL-AV benchmark.

02

Effectively leverages pre-trained self-supervised models for hierarchical feature extraction.

03

Demonstrated improved performance over existing unimodal methods in challenging environments.

Abstract

The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Speech and Audio Processing