SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models
Yuan-Kuei Wu, Yang Liu, Yiteng Huang, Zhaojun Yang, Haibin Wu, Ruizhe Huang, Yi-Te (Ethan) Hsu, Shuyu Kong, Ming Sun, Florian Metze, and Li Wan

TL;DR
This paper presents a novel test-time adaptation framework for generative spoken language models that improves robustness to acoustic variations during inference without requiring additional data or labels.
Contribution
It introduces the first TTA method for generative SLMs that updates minimal parameters on-the-fly, enhancing robustness while maintaining efficiency.
Findings
Consistent performance improvements across speech recognition, translation, and understanding tasks.
No degradation in core task accuracy despite adaptation.
Supports deployment on resource-constrained devices.
Abstract
Spoken Language Models (SLMs) are increasingly central to modern speech-driven applications, but performance degrades under acoustic shift - real-world noise, reverberation, and microphone variation. Prior solutions rely on offline domain adaptation, which is post-hoc, data-intensive, and slow. We introduce the first test-time adaptation (TTA) framework for generative SLMs that process interleaved audio-text prompts. Our method updates a small, targeted subset of parameters during inference using only the incoming utterance, requiring no source data or labels. This stabilizes token distributions and improves robustness to acoustic variability without degrading core task accuracy. Evaluated on automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench, our approach yields consistent gains under diverse corruptions. Because adaptation touches only a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
