Efficient Interleaved Speech Modeling through Knowledge Distillation

Mohammadmahdi Nouriborji; Morteza Rohanian

arXiv:2506.23670·cs.SD·October 23, 2025

Efficient Interleaved Speech Modeling through Knowledge Distillation

Mohammadmahdi Nouriborji, Morteza Rohanian

PDF

Open Access 3 Models

TL;DR

This paper introduces TinyWave, a compact speech generation model achieved through layer-aligned knowledge distillation, enabling efficient speech and speech-text generation with minimal performance loss, suitable for deployment on limited hardware.

Contribution

The paper presents a novel layer-aligned distillation method to compress large multimodal transformers into 2B-parameter models for speech generation.

Findings

01

TinyWave achieves near-teacher performance on speech tasks.

02

Models outperform size-matched baselines in accuracy.

03

Supports real-time speech and speech-text generation on commodity hardware.

Abstract

Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems