# Spoof detection with dynamic learnable sparse attention and tri-modal fusion in resource-constrained audio systems

**Authors:** Xinwei Wang, Zhicheng Tan, Guo Li

PMC · DOI: 10.1371/journal.pone.0335240 · PLOS One · 2025-12-31

## TL;DR

This paper introduces a new method for detecting fake audio in speaker verification systems, using efficient attention mechanisms and combining multiple audio features.

## Contribution

The novel Dynamic Learnable Sparse Attention framework reduces computational costs by 80% while improving spoof detection performance.

## Key findings

- The proposed method achieves an Equal Error Rate of 0.68% on the ASVspoof 2019 LA dataset.
- It outperforms existing methods by 33.6% in EER reduction.
- The framework reduces computational costs by 80% compared to traditional MHA-based methods.

## Abstract

Audio sensors, essential for automatic speaker verification (ASV) systems, face growing threats from spoofed audio generated by advanced speech synthesis techniques. Traditional spoof detection methods, such as those based on computationally intensive Multi-Head Attention (MHA), suffer from quadratic complexity (O(T2)) and high memory demands, making them impractical for deployment on resource-constrained audio sensors. To address these limitations, we propose a novel Dynamic Learnable Sparse Attention (DLSA) framework that integrates Mel-Frequency Cepstral Coefficients (MFCC), Constant-Q Transform (CQT), and raw waveform modalities for spoof detection. The DLSA module introduces a learnable attention mechanism that dynamically selects key spectral and temporal features from MFCC and CQT for cross-modal fusion. A ResNet backbone is used to extract features from the raw waveform. We also introduce a hybrid loss function combining cross-entropy loss (ℒCE) and center loss (ℒcenter), optimizing intra-class compactness and inter-class separability. Compared to MHA-based methods, our approach reduces computational costs by 80%. Experimental results on the ASVspoof 2019 Logical Access (LA) dataset demonstrate a significant performance boost, achieving an Equal Error Rate (EER) of 0.68% and a minimum tandem Detection Cost Function (t-DCF) of 0.0173, outperforming existing methods by 33.6% in EER reduction. This approach provides an efficient and robust solution for spoof detection in resource-constrained ASV systems.

## Full-text entities

- **Diseases:** MFCC (MESH:D006316)
- **Chemicals:** DCF (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12755748/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12755748/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/PMC12755748/full.md

---
Source: https://tomesphere.com/paper/PMC12755748