Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

R. Oguz Araz; Guillem Cort\`{e}s-Sebasti\`{a}; Emilio Molina; Joan Serr\`{a}; Xavier Serra; Yuki Mitsufuji; Dmitry Bogdanov

arXiv:2506.22661·cs.SD·July 1, 2025

Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

R. Oguz Araz, Guillem Cort\`{e}s-Sebasti\`{a}, Emilio Molina, Joan Serr\`{a}, Xavier Serra, Yuki Mitsufuji, Dmitry Bogdanov

PDF

Open Access

TL;DR

This paper improves neural audio fingerprinting robustness to real-world audio degradations by proposing best practices, systematically evaluating metric learning methods, and demonstrating state-of-the-art results with a self-supervised triplet loss approach.

Contribution

It introduces best practices for self-supervised training, systematically compares metric learning approaches, and achieves state-of-the-art performance in music identification under degraded conditions.

Findings

01

Self-supervised triplet loss outperforms other metric learning methods.

02

Training with multiple positives has different effects depending on the loss function.

03

Proposed approach achieves state-of-the-art results on degraded and real-world datasets.

Abstract

Audio fingerprinting (AFP) allows the identification of unknown audio content by extracting compact representations, termed audio fingerprints, that are designed to remain robust against common audio degradations. Neural AFP methods often employ metric learning, where representation quality is influenced by the nature of the supervision and the utilized loss function. However, recent work unrealistically simulates real-life audio degradation during training, resulting in sub-optimal supervision. Additionally, although several modern metric learning approaches have been proposed, current neural AFP methods continue to rely on the NT-Xent loss without exploring the recent advances or classical alternatives. In this work, we propose a series of best practices to enhance the self-supervision by leveraging musical signal properties and realistic room acoustics. We then present the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies