Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

Bajian Xiang; Tingwei Guo; Xuan Chen; Yang Han

arXiv:2604.06871·cs.CL·April 9, 2026

Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

Bajian Xiang, Tingwei Guo, Xuan Chen, Yang Han

PDF

TL;DR

This paper investigates redundancy in large speech language models, revealing that deep layers can be aggressively compressed without losing semantic content, leading to significant efficiency improvements.

Contribution

It introduces Affinity Pooling, a novel, training-free token merging method that reduces computational costs while preserving model accuracy.

Findings

01

Deep layers exhibit extreme redundancy allowing aggressive compression

02

Affinity Pooling reduces FLOPs by 27.48% without accuracy loss

03

Deployment shows up to 1.7x memory savings and 1.1x faster inference

Abstract

Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs. In this paper, we empirically revisit the necessity of such granular token-level processing. Through layer-wise oracle interventions, we unveil a structured redundancy hierarchy: while shallow layers encode essential acoustic details, deep layers exhibit extreme redundancy, allowing for aggressive compression. Motivated by these findings, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. By strategically applying this method at both input and deep layers, we effectively compress speech representations without compromising semantic information. Extensive evaluations across three tasks demonstrate that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.