Deepfake Audio Detection Using Self-supervised Fusion Representations

Khalid Zaman; Qixuan Huang; Muhammad Uzair; Masashi Unoki

arXiv:2605.03420·cs.SD·May 6, 2026

Deepfake Audio Detection Using Self-supervised Fusion Representations

Khalid Zaman, Qixuan Huang, Muhammad Uzair, Masashi Unoki

PDF

TL;DR

This paper introduces a dual-branch deepfake detection framework that jointly models speech and environmental sounds using pretrained models and advanced fusion techniques, achieving improved detection performance.

Contribution

It proposes a novel fusion approach combining pretrained speech and environmental sound models with interaction modules for component-level deepfake detection.

Findings

01

Achieved an F1-score of 70.20% on the test set.

02

Attained an environmental EER of 16.54%, outperforming baseline.

03

Effectively models independent manipulations of speech and environment.

Abstract

This paper describes a submission to the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026, which addresses component-level deepfake detection using the CompSpoofV2 dataset, where speech and environmental sounds may be independently manipulated. To address this challenge, a dual-branch deepfake detection framework is proposed to jointly model speech and environmental contextual representations from input audio. Two pretrained models, XLS-R for speech and BEATs for environmental sound, are used to extract complementary contextual representations. A Matching Head is introduced to model representation differences through statistical normalization and representation interaction, enabling estimation of the original class. In parallel, multi-head cross-attention enables effective information exchange between speech and environmental components. The refined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.