AxLSTMs: learning self-supervised audio representations with xLSTMs

Sarthak Yadav; Sergios Theodoridis; Zheng-Hua Tan

arXiv:2408.16568·cs.SD·August 20, 2025

AxLSTMs: learning self-supervised audio representations with xLSTMs

Sarthak Yadav, Sergios Theodoridis, Zheng-Hua Tan

PDF

Open Access

TL;DR

This paper introduces AxLSTM, a self-supervised audio representation learning method using extended LSTMs, which outperforms transformer-based models on multiple tasks with fewer parameters.

Contribution

It proposes AxLSTM, a novel self-supervised learning approach for audio using xLSTMs, demonstrating superior performance and efficiency over transformer baselines.

Findings

01

AxLSTM outperforms SSAST by up to 25% in relative performance.

02

AxLSTM has up to 45% fewer parameters than transformer models.

03

Pretrained on AudioSet, AxLSTM generalizes well across diverse tasks.

Abstract

While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach for learning audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 25% in relative performance across a set of ten diverse downstream tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing