Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
Bajian Xiang, Tingwei Guo, Xuan Chen, Yang Han

TL;DR
This paper investigates redundancy in large speech language models, revealing that deep layers can be aggressively compressed without losing semantic content, leading to significant efficiency improvements.
Contribution
It introduces Affinity Pooling, a novel, training-free token merging method that reduces computational costs while preserving model accuracy.
Findings
Deep layers exhibit extreme redundancy allowing aggressive compression
Affinity Pooling reduces FLOPs by 27.48% without accuracy loss
Deployment shows up to 1.7x memory savings and 1.1x faster inference
Abstract
Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs. In this paper, we empirically revisit the necessity of such granular token-level processing. Through layer-wise oracle interventions, we unveil a structured redundancy hierarchy: while shallow layers encode essential acoustic details, deep layers exhibit extreme redundancy, allowing for aggressive compression. Motivated by these findings, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. By strategically applying this method at both input and deep layers, we effectively compress speech representations without compromising semantic information. Extensive evaluations across three tasks demonstrate that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
