Simultaneously Learning Robust Audio Embeddings and balanced Hash codes for Query-by-Example
Anup Singh, Kris Demuynck, Vipul Arora

TL;DR
This paper introduces a self-supervised learning framework that simultaneously generates robust audio embeddings and balanced hash codes, improving retrieval speed and accuracy in large-scale audio fingerprinting systems.
Contribution
It proposes a novel end-to-end approach modeling hash codes as a balanced clustering problem using optimal transport, enhancing performance over existing methods.
Findings
Improved retrieval efficiency at high distortion levels.
High accuracy maintained with balanced hash codes.
System is scalable in computation and memory.
Abstract
Audio fingerprinting systems must efficiently and robustly identify query snippets in an extensive database. To this end, state-of-the-art systems use deep learning to generate compact audio fingerprints. These systems deploy indexing methods, which quantize fingerprints to hash codes in an unsupervised manner to expedite the search. However, these methods generate imbalanced hash codes, leading to their suboptimal performance. Therefore, we propose a self-supervised learning framework to compute fingerprints and balanced hash codes in an end-to-end manner to achieve both fast and accurate retrieval performance. We model hash codes as a balanced clustering process, which we regard as an instance of the optimal transport problem. Experimental results indicate that the proposed approach improves retrieval efficiency while preserving high accuracy, particularly at high distortion levels,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Advanced Image and Video Retrieval Techniques
