DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners
Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio, Feris, James Glass

TL;DR
This paper introduces DASS, a distilled audio state space model that outperforms Transformers on AudioSet and demonstrates superior scalability to long audio durations, addressing previous limitations of SSMs in short and long audio tasks.
Contribution
The paper presents a novel knowledge distillation approach for audio SSMs, achieving state-of-the-art results and introducing a new long-duration audio retrieval test.
Findings
DASS outperforms Transformers on AudioSet with an mAP of 48.9.
DASS effectively retrieves sound events in recordings up to 2.5 hours long.
Audio SSMs show superior duration scalability compared to Transformers.
Abstract
State-space models (SSMs) have emerged as an alternative to Transformers for audio modeling due to their high computational efficiency with long inputs. While recent efforts on Audio SSMs have reported encouraging results, two main limitations remain: First, in 10-second short audio tagging tasks, Audio SSMs still underperform compared to Transformer-based models such as Audio Spectrogram Transformer (AST). Second, although Audio SSMs theoretically support long audio inputs, their actual performance with long audio has not been thoroughly evaluated. To address these limitations, in this paper, 1) We applied knowledge distillation in audio space model training, resulting in a model called Knowledge Distilled Audio SSM (DASS). To the best of our knowledge, it is the first SSM that outperforms the Transformers on AudioSet and achieves an mAP of 48.9; and 2) We designed a new test called…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗saurabhati/DASS_small_AudioSet_47.2model· 2 dl· ♡ 12 dl♡ 1
- 🤗saurabhati/DASS_medium_AudioSet_47.6model· 2 dl2 dl
- 🤗saurabhati/DASS_small_AudioSet_48.6model· 10 dl10 dl
- 🤗saurabhati/DASS_medium_AudioSet_48.9model
- 🤗saurabhati/DASS_small_AudioSet_50.1model· 45 dl45 dl
- 🤗saurabhati/DASS_medium_AudioSet_50.2model· 53 dl· ♡ 253 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Layer Normalization · Label Smoothing · Adam · Dropout
