Unsupervised Composable Representations for Audio

Giovanni Bindi; Philippe Esling

arXiv:2408.09792·cs.LG·August 20, 2024

Unsupervised Composable Representations for Audio

Giovanni Bindi, Philippe Esling

PDF

Open Access 1 Repo

TL;DR

This paper introduces an unsupervised framework for compositional audio representations that leverages auto-encoding and diffusion models, enabling high-quality source separation and generation with lower computational costs.

Contribution

The paper presents a novel unsupervised approach using auto-encoding and diffusion models for compositional audio representations, improving source separation and generation performance.

Findings

01

Achieves comparable or better source separation than existing methods.

02

Surpasses supervised baselines on signal-to-interference ratio metrics.

03

Operates efficiently in the latent space of neural audio codecs.

Abstract

Current generative models are able to generate high-quality artefacts but have been shown to struggle with compositional reasoning, which can be defined as the ability to generate complex structures from simpler elements. In this paper, we focus on the problem of compositional representation learning for music data, specifically targeting the fully-unsupervised setting. We propose a simple and extensible framework that leverages an explicit compositional inductive bias, defined by a flexible auto-encoding objective that can leverage any of the current state-of-art generative models. We demonstrate that our framework, used with diffusion models, naturally addresses the task of unsupervised audio source separation, showing that our model is able to perform high-quality separation. Our findings reveal that our proposal achieves comparable or superior performance with respect to other blind…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ismir-24-sub/unsupervised_compositional_representations
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsDiffusion · Focus