MSCT: Differential Cross-Modal Attention for Deepfake Detection

Fangda Wei; Miao Liu; Yingxue Wang; Jing Wang; Shenghui Zhao; Nan Li

arXiv:2604.07741·cs.CV·April 10, 2026

MSCT: Differential Cross-Modal Attention for Deepfake Detection

Fangda Wei, Miao Liu, Yingxue Wang, Jing Wang, Shenghui Zhao, Nan Li

PDF

TL;DR

This paper introduces MSCT, a multi-scale cross-modal transformer encoder that enhances deepfake detection by improving feature extraction and modal alignment between audio and video modalities.

Contribution

The paper proposes a novel MSCT model with multi-scale self-attention and differential cross-modal attention for more effective deepfake detection.

Findings

01

Achieves competitive results on FakeAVCeleb dataset.

02

Improves feature integration and modal alignment in deepfake detection.

03

Validates the effectiveness of the proposed MSCT structure.

Abstract

Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.