State-Space Large Audio Language Models

Saurabhchand Bhati; Yuan Gong; Leonid Karlinsky; Hilde Kuehne; Rogerio; Feris; James Glass

arXiv:2411.15685·eess.AS·November 26, 2024

State-Space Large Audio Language Models

Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio, Feris, James Glass

PDF

Open Access

TL;DR

This paper introduces the first state-space-based Large Audio Language Model, replacing transformers in perception and language modules, achieving competitive performance with fewer parameters and addressing computational challenges.

Contribution

It pioneers the use of state-space models in large audio language models, replacing transformers for perception and language understanding modules.

Findings

01

State-space LALM performs competitively on close-ended tasks.

02

Significantly fewer parameters than transformer-based models.

03

Addresses computational scalability issues.

Abstract

Large Audio Language Models (LALM) combine the audio perception models and the Large Language Models (LLM) and show a remarkable ability to reason about the input audio, infer the meaning, and understand the intent. However, these systems rely on Transformers which scale quadratically with the input sequence lengths which poses computational challenges in deploying these systems in memory and time-constrained scenarios. Recently, the state-space models (SSMs) have emerged as an alternative to transformer networks. While there have been successful attempts to replace transformer-based audio perception models with state-space ones, state-space-based LALMs remain unexplored. First, we begin by replacing the transformer-based audio perception module and then replace the transformer-based LLM and propose the first state-space-based LALM. Experimental results demonstrate that space-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing