TL;DR
This paper investigates how data processing and curation strategies impact speech-language model pretraining, demonstrating that careful data management can significantly enhance performance even with smaller models.
Contribution
It provides a systematic data-centric analysis for SpeechLM pretraining, introducing effective data processing, augmentation, and sequencing techniques that improve model performance.
Findings
Data curation significantly boosts SpeechLM performance.
Synthetic data augmentation enhances model capabilities.
Proper data sequencing improves training efficiency.
Abstract
Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs. We focus on three research questions fundamental to speech-language pretraining data: (1) how to process raw web-crawled audio content for speech-text pretraining, (2) how to construct synthetic pretraining datasets to augment web-crawled data and (3) how to interleave (text, audio) segments into training sequences. We…
Peer Reviews
Decision·ICLR 2026 ConditionalPoster
- The work does well designed low-level ablations that lead to clear suggestions about the design space of Speech-Language Pretraining. This type of non-glamorous but important study seems extremely likely to be valuable to other practitioners in the space and enables the authors to train a strong model themselves! - The paper goes above and beyond most works of any form in terms of experimental rigor, including running contamination analysis. - The synthetic data study in addition to the domain
- The work primarily focuses on evaluations in which the model must generate text, but does not evaluate how these decisions impact the models ability to generate speech in either S->S settings or in TTS usage. - The works evaluations of the whole system does not compare to the simplest baseline of pipelining the text-init with a common ASR system such as Whisper. - For a data centric work, the work doesn't actually provide much in the way of details of what the original source 10M hours of audi
Quality: 1. The paper presents a well-controlled ablation study, 1 example is the study on different granularity of chunking and interleaving, deterministic vs stochastic sampling schemes. 2. The paper presents data-driven diagnostics and analysis, for example: modality alignment analysis by KL divergence; topic-coverage analysis; contamination checks with n-gram matching. Such analysis provides deep insights that go beyond the benchmark comparison. Clarity: The paper employs proper diagram to
1. Task scope. This paper limits the evaluation target in Spoken QA (plus text understanding) while positioning the goal of the proposed method as optimizing speech-text pretraining. There could be doubt whether solely SQA is representative. The author need to somewhat prove that correlation between SQA performance and speech-text pretraining quality. Author discusses about this in Addendum K: > One caveat preventing us from a direct comparison on such tasks is that we do not employ any task-spe
The article is clearly structured and easy to follow. It provides an analysis of the construction details of speech data in large speech models, a topic that has not been extensively explored in existing works.
1. Overall, this work remains largely analytical, and the answers to several questions are relatively straightforward. For instance, the conclusion that "fine-grained interleaved tokens are better than coarse-grained" is not particularly surprising, as mainstream approaches in speech-to-speech LLMs already employ word-level interleaved strategies, which are inherently fine-grained. Therefore, this finding does not significantly impact the development of large speech models, and its contribution
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
