Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
Wei-Cheng Tseng, Xuanru Zhou, Mingyue Huo, Yiwen Shao, Hao Zhang, Dong Yu

TL;DR
This paper introduces a large-scale audio-text dataset and systematically evaluates contrastive and captioning objectives for audio-language pretraining, demonstrating their effectiveness in learning general-purpose audio representations across diverse tasks.
Contribution
It provides the first comprehensive evaluation of audio-language pretraining methods, introduces the CaptionStew dataset, and offers insights into objective strengths and scaling behaviors.
Findings
Contrastive learning is more data-efficient at small scales.
Captioning excels at language-involved audio tasks with more data.
Supervised initialization offers limited benefits at scale.
Abstract
Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. It's useful to see how model performance scale with the dataset size.
1. The paper has a lot of missing details in writing: What do the settings “PSC” and “MC” mean in Table 3.b? What do “AEC,” “SID,” “SER,” “MTAG,” “INST” and “SED” mean in Table 3.a? Please do not assume readers are already familiar with these acronyms. 2. The paper's experimental setting is unclear. It shows a lot of settings, but the exact setting of the method is not mentioned. Please do not assume reader can understand these setting without explaination. For example, the author list results
The paper demonstrates several notable strengths, particularly through its systematic approach to addressing existing limitations and its comprehensive empirical validation of audio-language pretraining (ALP) as a pathway to developing general-purpose audio representations. One of its most significant contributions is the introduction of the CaptionStew (CS10M) dataset, a large-scale and diverse resource specifically designed to overcome the scarcity of extensive audio-text corpora that has long
A major contribution of the paper, the CaptionStew (CS10M) dataset, also represents a few limitation. While its scale and diversity mark a substantial advance for audio-language pretraining, the authors recognize several issues inherent in its design. First, CaptionStew’s reliance on LLM-synthesized and augmented captions introduces potential biases and artifacts, as no extensive human verification or quality control was conducted across the dataset. This could affect the fidelity and reliabilit
The experimental setup is a major strength of this work. The evaluation is extensive, covering linear probing, audio-language tasks, and open-form QA. The controlled experiments on key variables, pretraining objectives (contrastive vs. captioning), data scaling (from 400K to 10M pairs), and initialization (from scratch vs. supervised pretraining)—are exceptionally thorough and provide robust, convincing evidence for the claims. The paper offers several key findings that challenge common practice
The method for creating the core contribution, CaptionStew, is arguably too simplistic. While aggregating existing datasets is a low-cost solution to the data scarcity problem, it inherits the potential quality issues of the source datasets (e.g., LLM-synthesized captions in WavCaps and AudioSetCaps). The authors acknowledge the limited linguistic diversity in the analysis, but the approach feels unrefined and may limit the upper bound of what the pretrained models can learn.
1. The systematic and comprehensive evaluation is a major strength of the paper. The authors go beyond the standard audio-text retrieval benchmark and evaluate representations across a wide suite of tasks (audio event classification, speaker ID, emotion recognition, music tagging, sound event detection, QA) and protocols (linear probing, retrieval, captioning, LLM-based QA). This provides a holistic and convincing view of the models' capabilities and transferability. 2. The paper provides the fi
1. The work's foundational elements lack significant innovation. The CaptionStew dataset, while large-scale and practically useful, is an aggregation of existing public datasets without a novel data creation or curation methodology. Similarly, the two primary technical insights—that contrastive learning is highly data-efficient and that generative (captioning) objectives scale well with more data—are well-established principles in machine learning, particularly in vision and language domains. Th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
