SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models
Hsiao-Ying Huang, Cheng-Han Chiang, Hung-yi Lee

TL;DR
SPAR-K is a novel early exit framework for spoken language models that accelerates inference by selectively exiting at intermediate layers, maintaining high accuracy and perceptual quality across various speech tasks.
Contribution
Introduces SPAR-K, a modality-aware early exit method with a speech alternating-depth schedule, tailored for interleaved SLMs to reduce decoding complexity without sacrificing quality.
Findings
Reduces speech decoding depth by up to 11% with minimal accuracy loss.
Maintains perceptual quality with negligible MOS and WER changes.
Confidence-based early exit strategies are suboptimal for SLMs.
Abstract
Interleaved spoken language models (SLMs) alternately generate text and speech tokens, but decoding at full transformer depth for every step becomes costly, especially due to long speech sequences. We propose SPAR-K, a modality-aware early exit framework designed to accelerate interleaved SLM inference while preserving perceptual quality. SPAR-K introduces a speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth "refresh" steps mitigate distribution shift due to early exit. We evaluate our framework using Step-Audio-2-mini and GLM-4-Voice across four datasets spanning reasoning, factual QA, and dialogue tasks, measuring performance in terms of ASR transcription accuracy and perceptual quality. Experimental results demonstrate that SPAR-K largely preserves question-answering accuracy with a maximum accuracy drop of 0.82\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
