Self-Supervised Multi-View Learning for Disentangled Music Audio Representations
Julia Wilkins, Sivan Ding, Magdalena Fuentes, Juan Pablo Bello

TL;DR
This paper introduces a self-supervised multi-view learning framework for music audio that effectively disentangles shared and private representations, improving the quality of learned audio features without labeled data.
Contribution
It presents a novel SSL method that explicitly separates shared and private audio features, addressing entanglement issues in previous approaches.
Findings
Effective disentanglement of shared and private representations demonstrated
Improved robustness and generalization in music audio tasks shown
Case study confirms the method's effectiveness in controlled settings
Abstract
Self-supervised learning (SSL) offers a powerful way to learn robust, generalizable representations without labeled data. In music, where labeled data is scarce, existing SSL methods typically use generated supervision and multi-view redundancy to create pretext tasks. However, these approaches often produce entangled representations and lose view-specific information. We propose a novel self-supervised multi-view learning framework for audio designed to incentivize separation between private and shared representation spaces. A case study on audio disentanglement in a controlled setting demonstrates the effectiveness of our method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
