Improving Query-by-Vocal Imitation with Contrastive Learning and Audio   Pretraining

Jonathan Greif; Florian Schmid; Paul Primus; Gerhard Widmer

arXiv:2408.11638·eess.AS·August 22, 2024

Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining

Jonathan Greif, Florian Schmid, Paul Primus, Gerhard Widmer

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel QBV system that leverages pre-trained CNN audio models and contrastive learning to improve audio retrieval accuracy, achieving state-of-the-art results in vocal imitation search tasks.

Contribution

It presents a new end-to-end fine-tuning approach using contrastive learning with pre-trained audio models for improved QBV performance.

Findings

01

Significant performance improvements over previous methods.

02

Achieves state-of-the-art results on QBV benchmarks.

03

Effective use of contrastive learning with pre-trained models.

Abstract

Query-by-Vocal Imitation (QBV) is about searching audio files within databases using vocal imitations created by the user's voice. Since most humans can effectively communicate sound concepts through voice, QBV offers the more intuitive and convenient approach compared to text-based search. To fully leverage QBV, developing robust audio feature representations for both the vocal imitation and the original sound is crucial. In this paper, we present a new system for QBV that utilizes the feature extraction capabilities of Convolutional Neural Networks pre-trained with large-scale general-purpose audio datasets. We integrate these pre-trained models into a dual encoder architecture and fine-tune them end-to-end using contrastive learning. A distinctive aspect of our proposed method is the fine-tuning strategy of pre-trained models using an adapted NT-Xent loss for contrastive learning,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jonathan-Greif/QBV
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsNormalized Temperature-scaled Cross Entropy Loss