Efficient Interleaved Speech Modeling through Knowledge Distillation
Mohammadmahdi Nouriborji, Morteza Rohanian

TL;DR
This paper introduces TinyWave, a compact speech generation model achieved through layer-aligned knowledge distillation, enabling efficient speech and speech-text generation with minimal performance loss, suitable for deployment on limited hardware.
Contribution
The paper presents a novel layer-aligned distillation method to compress large multimodal transformers into 2B-parameter models for speech generation.
Findings
TinyWave achieves near-teacher performance on speech tasks.
Models outperform size-matched baselines in accuracy.
Supports real-time speech and speech-text generation on commodity hardware.
Abstract
Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
