CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical   Temporal Structure Augmentation

Junda Wu; Warren Li; Zachary Novack; Amit Namburi; Carol Chen; Julian; McAuley

arXiv:2410.02271·cs.SD·October 4, 2024

CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical Temporal Structure Augmentation

Junda Wu, Warren Li, Zachary Novack, Amit Namburi, Carol Chen, Julian, McAuley

PDF

Open Access

TL;DR

CoLLAP introduces a novel contrastive pretraining method that extends the perception window for long-form audio and text, leveraging musical temporal structures to improve multimodal alignment and retrieval tasks.

Contribution

The paper presents a new contrastive learning architecture for long-form music-audio and text, utilizing musical temporal structures and large-scale data to enhance multimodal representation learning.

Findings

01

Improved retrieval accuracy on long-form music-text datasets.

02

Effective transfer of pretrained models to diverse music information retrieval tasks.

03

Demonstrated benefits of temporal structure augmentation in multimodal contrastive learning.

Abstract

Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsSoftmax · Attention Is All You Need · Contrastive Learning