Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient   Music-Text Representation Learning

Ilaria Manco; Justin Salamon; Oriol Nieto

arXiv:2409.11498·cs.SD·September 19, 2024

Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning

Ilaria Manco, Justin Salamon, Oriol Nieto

PDF

Open Access

TL;DR

This paper investigates key design choices in music-text contrastive learning for audio models, highlighting data curation's importance and introducing techniques to enhance training diversity and performance efficiently.

Contribution

It identifies the impact of base encoders, data curation, and text augmentation on model quality, and proposes Augmented View Dropout and TextSwap to improve diversity without extra costs.

Findings

01

Data curation is crucial in resource-limited settings.

02

Proposed techniques boost performance across models and datasets.

03

Methods do not increase computational costs or data requirements.

Abstract

Audio-text contrastive models have become a powerful approach in music representation learning. Despite their empirical success, however, little is known about the influence of key design choices on the quality of music-text representations learnt through this framework. In this work, we expose these design choices within the constraints of limited data and computation budgets, and establish a more solid understanding of their impact grounded in empirical observations along three axes: the choice of base encoders, the level of curation in training data, and the use of text augmentation. We find that data curation is the single most important factor for music-text contrastive training in resource-constrained scenarios. Motivated by this insight, we introduce two novel techniques, Augmented View Dropout and TextSwap, which increase the diversity and descriptiveness of text inputs seen in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Natural Language Processing Techniques · Cancer-related molecular mechanisms research

MethodsDropout · Balanced Selection