Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models

Haoqin Sun; Chenyang Lyu; Shiwan Zhao; Xuanfan Ni; Xiangyu Kong; Longyue Wang; Weihua Luo; Yong Qin

arXiv:2602.05373·cs.SD·February 6, 2026

Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models

Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin

PDF

Open Access

TL;DR

Speech-XL introduces a novel approach to long-form speech understanding by leveraging key-value sparsification and a special summarization token, enabling efficient compression and processing of extended audio sequences in large speech language models.

Contribution

The paper proposes Speech-XL, which uses a new Speech Summarization Token and curriculum learning to effectively compress long speech inputs for large models, addressing key bottlenecks.

Findings

01

Achieves competitive performance on LongSpeech and AUDIOMARATHON benchmarks.

02

Utilizes significantly less training data than baseline models.

03

Effectively compresses long-form speech with high-ratio summarization.

Abstract

Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing