Self-Supervised Multi-View Learning for Disentangled Music Audio   Representations

Julia Wilkins; Sivan Ding; Magdalena Fuentes; Juan Pablo Bello

arXiv:2411.02711·cs.SD·November 6, 2024

Self-Supervised Multi-View Learning for Disentangled Music Audio Representations

Julia Wilkins, Sivan Ding, Magdalena Fuentes, Juan Pablo Bello

PDF

Open Access

TL;DR

This paper introduces a self-supervised multi-view learning framework for music audio that effectively disentangles shared and private representations, improving the quality of learned audio features without labeled data.

Contribution

It presents a novel SSL method that explicitly separates shared and private audio features, addressing entanglement issues in previous approaches.

Findings

01

Effective disentanglement of shared and private representations demonstrated

02

Improved robustness and generalization in music audio tasks shown

03

Case study confirms the method's effectiveness in controlled settings

Abstract

Self-supervised learning (SSL) offers a powerful way to learn robust, generalizable representations without labeled data. In music, where labeled data is scarce, existing SSL methods typically use generated supervision and multi-view redundancy to create pretext tasks. However, these approaches often produce entangled representations and lose view-specific information. We propose a novel self-supervised multi-view learning framework for audio designed to incentivize separation between private and shared representation spaces. A case study on audio disentanglement in a controlled setting demonstrates the effectiveness of our method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis