AxLSTMs: learning self-supervised audio representations with xLSTMs
Sarthak Yadav, Sergios Theodoridis, Zheng-Hua Tan

TL;DR
This paper introduces AxLSTM, a self-supervised audio representation learning method using extended LSTMs, which outperforms transformer-based models on multiple tasks with fewer parameters.
Contribution
It proposes AxLSTM, a novel self-supervised learning approach for audio using xLSTMs, demonstrating superior performance and efficiency over transformer baselines.
Findings
AxLSTM outperforms SSAST by up to 25% in relative performance.
AxLSTM has up to 45% fewer parameters than transformer models.
Pretrained on AudioSet, AxLSTM generalizes well across diverse tasks.
Abstract
While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach for learning audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 25% in relative performance across a set of ten diverse downstream tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
