XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection
Kwok-Ho Ng, Tingting Song, Yongdong Wu, Zhihua Xia

TL;DR
This paper introduces XLSR-MamBo, a hybrid architecture combining XLSR and Mamba-Attention backbones, which effectively detects audio deepfakes by capturing global artifacts and demonstrating strong performance on multiple benchmarks.
Contribution
The work presents a scalable hybrid framework with novel topological designs, showing that increasing backbone depth improves detection stability and accuracy in audio deepfake detection.
Findings
Achieves competitive results on ASVspoof 2021 benchmarks.
Hydra's bidirectional modeling captures temporal dependencies efficiently.
Deeper backbones reduce performance variance and improve robustness.
Abstract
Advanced speech synthesis technologies have enabled highly realistic speech generation, posing security risks that motivate research into audio deepfake detection (ADD). While state space models (SSMs) offer linear complexity, pure causal SSMs architectures often struggle with the content-based retrieval required to capture global frequency-domain artifacts. To address this, we explore the scaling properties of hybrid architectures by proposing XLSR-MamBo, a modular framework integrating an XLSR front-end with synergistic Mamba-Attention backbones. We systematically evaluate four topological designs using advanced SSM variants, Mamba, Mamba2, Hydra, and Gated DeltaNet. Experimental results demonstrate that the MamBo-3-Hydra-N3 configuration achieves competitive performance compared to other state-of-the-art systems on the ASVspoof 2021 LA, DF, and In-the-Wild benchmarks. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
