DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training
Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, and Yali Li

TL;DR
DSCLAP is a domain-specific pre-training framework that aligns raw audio with transcriptions using contrastive learning, improving multimodal understanding for voice assistants without requiring domain-specific labeled data.
Contribution
It introduces a novel pre-training method that uses raw audio and ASR transcriptions with contrastive learning, tailored for domain-specific voice assistant applications.
Findings
Outperforms baseline models on downstream tasks
Effective with only raw audio input and ASR transcriptions
Shows significant improvements in domain-specific voice assistant tasks
Abstract
Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Phonetics and Phonology Research
