DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training

Shengqiang Liu; Da Liu; Anna Wang; Zhiyu Zhang; Jie Gao; and Yali Li

arXiv:2409.09289·cs.SD·September 17, 2024

DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training

Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, and Yali Li

PDF

Open Access

TL;DR

DSCLAP is a domain-specific pre-training framework that aligns raw audio with transcriptions using contrastive learning, improving multimodal understanding for voice assistants without requiring domain-specific labeled data.

Contribution

It introduces a novel pre-training method that uses raw audio and ASR transcriptions with contrastive learning, tailored for domain-specific voice assistant applications.

Findings

01

Outperforms baseline models on downstream tasks

02

Effective with only raw audio input and ASR transcriptions

03

Shows significant improvements in domain-specific voice assistant tasks

Abstract

Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Phonetics and Phonology Research