BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention

Yassine El Kheir; Tim Polzehl; Sebastian M\"oller

arXiv:2505.13930·cs.SD·May 21, 2025

BiCrossMamba-ST: Speech Deepfake Detection with Bidirectional Mamba Spectro-Temporal Cross-Attention

Yassine El Kheir, Tim Polzehl, Sebastian M\"oller

PDF

Open Access

TL;DR

BiCrossMamba-ST is a novel speech deepfake detection framework that uses bidirectional Mamba blocks and cross-attention to effectively identify synthetic speech cues, achieving significant performance improvements over existing methods.

Contribution

It introduces a dual-branch spectro-temporal architecture with mutual cross-attention and a convolution-based 2D attention map for robust deepfake detection.

Findings

01

Achieves 67.74% relative gain over AASIST on ASVSpoof LA21

02

Achieves 26.3% relative gain over AASIST on ASVSpoof DF21

03

Improves 6.80% over RawBMamba on ASVSpoof DF21

Abstract

We propose BiCrossMamba-ST, a robust framework for speech deepfake detection that leverages a dual-branch spectro-temporal architecture powered by bidirectional Mamba blocks and mutual cross-attention. By processing spectral sub-bands and temporal intervals separately and then integrating their representations, BiCrossMamba-ST effectively captures the subtle cues of synthetic speech. In addition, our proposed framework leverages a convolution-based 2D attention map to focus on specific spectro-temporal regions, enabling robust deepfake detection. Operating directly on raw features, BiCrossMamba-ST achieves significant performance improvements, a 67.74% and 26.3% relative gain over state-of-the-art AASIST on ASVSpoof LA21 and ASVSpoof DF21 benchmarks, respectively, and a 6.80% improvement over RawBMamba on ASVSpoof DF21. Code and models will be made publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Handwritten Text Recognition Techniques