Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Aohan Zeng; Zhengxiao Du; Mingdao Liu; Lei Zhang; Shengmin Jiang,; Yuxiao Dong; Jie Tang

arXiv:2411.17607·cs.CL·December 3, 2024

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang,, Yuxiao Dong, Jie Tang

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper introduces a scalable method for speech-text pre-training using synthetic interleaved data derived from text corpora, enabling large-scale speech language models without the need for parallel speech-text datasets.

Contribution

It proposes a novel synthetic data generation approach and a supervised speech tokenizer, significantly enhancing speech language model scalability and performance.

Findings

01

Achieved state-of-the-art speech language modeling results with 1 trillion tokens.

02

Improved spoken question answering accuracy from 13% to 31%.

03

Developed an end-to-end spoken chatbot with competitive performance.

Abstract

Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. The ASR-based speech tokenizer achieves semantic information preservation and decent speech audio reproduction at the same time. 2. The low-bitrate speech tokenizer and the text-to-token model effectively use the existing large amounts of text data to synthesize large amounts of speech tokens, which saves resources to collect large amounts of speech audio data and improves the language model's speech performance after pretraining.

Weaknesses

The weaknesses are mainly in terms of paper writing and presentation. 1. The paper mentions "we are first to use supervised semantic tokens for SpeechLMs". However, one of the baselines, Mini-Omini also uses a whisper-based speech tokenizer. 2. The details on how the speech and text modalities are interleaved are missing. 3. As an important part of the process, the details of the text-to-token model are missing—for example, model architectures, training schemes, etc. 4. The large amounts of s

Reviewer 02Rating 8Confidence 3

Strengths

This paper is a nice contribution to the very hot topic of speech LMs. By developing an effective speech tokenizer and text-to-tokenizer model the authors are able to create a very large speech language model that produces impressive results on a wide range of tasks. The authors perform extensive experiments and ablation studies on the speech tokenizer, speech generator (decoder), and the speech LM. The model is able to achieve strong performance on both spoken language modeling and spoken qu

Weaknesses

Although this is not necessarily a weakness, this paper seems very strong on the engineering side and a little weaker on the novelty side of things. The recipe the authors put forward consists of three separate steps 1) tokenizer, 2) text-to-token model 3) pretrain speech LM. While the authors build a strong tokenizer based on the Whisper model, the approach is not especially novel as it is built on top of a strong speech recognition model. Likewise the use of a TTS corpus to learn a text-to

Reviewer 03Rating 8Confidence 4

Strengths

- Supervised speech tokenizers are a great way to distill the content from audio. Audio is high-dimensional, and using text and a low bitrate bottleneck to focus on content is a good idea, suitable for SpeechLMs. - Training a “TTS” model to generate synthetic audio tokens is interesting - as it doesn’t require generating the final audio (high-bitrate, compute-intensive, issues with OOD synthetic data).  Instead, they generate latent audio tokens that focus on content. - the interleaving (repl

Weaknesses

- Several methodological evaluation details are missing (what was measured and how was it computed), mostly in Section 2.1 and Table 1 (See questions).  Whenever you report some metric with an intuitive non-exact name (e.g., Content Preservation - LS), you should explain somewhere it more precisely (e.g., Content Preservation: We run our quantized whisper on the LS (LibriSpeech) dataset to generate text and report the WER to the GT transcript) I understand that there’s a space limitation, but th

Code & Models

Repositories

thudm/glm-4-voice
pytorch

Models

🤗
ArtemisTAO/km
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques