Scaling Speech-Text Pre-training with Synthetic Interleaved Data
Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang,, Yuxiao Dong, Jie Tang

TL;DR
This paper introduces a scalable method for speech-text pre-training using synthetic interleaved data derived from text corpora, enabling large-scale speech language models without the need for parallel speech-text datasets.
Contribution
It proposes a novel synthetic data generation approach and a supervised speech tokenizer, significantly enhancing speech language model scalability and performance.
Findings
Achieved state-of-the-art speech language modeling results with 1 trillion tokens.
Improved spoken question answering accuracy from 13% to 31%.
Developed an end-to-end spoken chatbot with competitive performance.
Abstract
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a…
Peer Reviews
Decision·ICLR 2025 Poster
1. The ASR-based speech tokenizer achieves semantic information preservation and decent speech audio reproduction at the same time. 2. The low-bitrate speech tokenizer and the text-to-token model effectively use the existing large amounts of text data to synthesize large amounts of speech tokens, which saves resources to collect large amounts of speech audio data and improves the language model's speech performance after pretraining.
The weaknesses are mainly in terms of paper writing and presentation. 1. The paper mentions "we are first to use supervised semantic tokens for SpeechLMs". However, one of the baselines, Mini-Omini also uses a whisper-based speech tokenizer. 2. The details on how the speech and text modalities are interleaved are missing. 3. As an important part of the process, the details of the text-to-token model are missing—for example, model architectures, training schemes, etc. 4. The large amounts of s
This paper is a nice contribution to the very hot topic of speech LMs. By developing an effective speech tokenizer and text-to-tokenizer model the authors are able to create a very large speech language model that produces impressive results on a wide range of tasks. The authors perform extensive experiments and ablation studies on the speech tokenizer, speech generator (decoder), and the speech LM. The model is able to achieve strong performance on both spoken language modeling and spoken qu
Although this is not necessarily a weakness, this paper seems very strong on the engineering side and a little weaker on the novelty side of things. The recipe the authors put forward consists of three separate steps 1) tokenizer, 2) text-to-token model 3) pretrain speech LM. While the authors build a strong tokenizer based on the Whisper model, the approach is not especially novel as it is built on top of a strong speech recognition model. Likewise the use of a TTS corpus to learn a text-to
- Supervised speech tokenizers are a great way to distill the content from audio. Audio is high-dimensional, and using text and a low bitrate bottleneck to focus on content is a good idea, suitable for SpeechLMs. - Training a “TTS” model to generate synthetic audio tokens is interesting - as it doesn’t require generating the final audio (high-bitrate, compute-intensive, issues with OOD synthetic data). Instead, they generate latent audio tokens that focus on content. - the interleaving (repl
- Several methodological evaluation details are missing (what was measured and how was it computed), mostly in Section 2.1 and Table 1 (See questions). Whenever you report some metric with an intuitive non-exact name (e.g., Content Preservation - LS), you should explain somewhere it more precisely (e.g., Content Preservation: We run our quantized whisper on the LS (LibriSpeech) dataset to generate text and report the WER to the GT transcript) I understand that there’s a space limitation, but th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
