Attention-Based Audio Embeddings for Query-by-Example

Anup Singh; Kris Demuynck; Vipul Arora

arXiv:2210.08624·eess.AS·November 22, 2024

Attention-Based Audio Embeddings for Query-by-Example

Anup Singh, Kris Demuynck, Vipul Arora

PDF

Open Access 1 Repo

TL;DR

This paper introduces a robust audio retrieval system using contrastive learning and attention mechanisms, significantly improving accuracy under high distortion while maintaining efficiency and scalability.

Contribution

The novel system employs a CNN with spectral-temporal attention and contrastive learning to generate noise-robust audio fingerprints for improved query matching.

Findings

01

Outperforms state-of-the-art systems at high distortion levels

02

Efficient in computation and memory usage

03

Scalable to larger databases

Abstract

An ideal audio retrieval system efficiently and robustly recognizes a short query snippet from an extensive database. However, the performance of well-known audio fingerprinting systems falls short at high signal distortion levels. This paper presents an audio retrieval system that generates noise and reverberation robust audio fingerprints using the contrastive learning framework. Using these fingerprints, the method performs a comprehensive search to identify the query audio and precisely estimate its timestamp in the reference audio. Our framework involves training a CNN to maximize the similarity between pairs of embeddings extracted from clean audio and its corresponding distorted and time-shifted version. We employ a channel-wise spectral-temporal attention mechanism to better discriminate the audio by giving more weight to the salient spectral-temporal patches in the signal.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

magcil/deep-audio-fingerprinting-benchmark
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsContrastive Learning