Towards Audio Token Compression in Large Audio Language Models

Saurabhchand Bhati; Samuel Thomas; Hilde Kuehne; Rogerio Feris; James Glass

arXiv:2511.20973·eess.AS·November 27, 2025

Towards Audio Token Compression in Large Audio Language Models

Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

PDF

Open Access

TL;DR

This paper introduces audio token compression techniques for large audio language models to improve scalability and efficiency, enabling longer audio processing and deployment on resource-limited devices without significant performance loss.

Contribution

It proposes novel audio token compression methods combined with low-rank adapters for finetuning, enhancing LALM scalability while maintaining accuracy.

Findings

01

Achieves up to threefold reduction in audio token count

02

Maintains near frame-level model performance

03

Effective for speech recognition and translation tasks

Abstract

Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing