Autoencoders for music sound modeling: a comparison of linear, shallow,   deep, recurrent and variational models

Fanny Roche (1; 2); Thomas Hueber (1); Samuel Limier (2) and; Laurent Girin (1; 3) ((1) Univ. Grenoble Alpes; CNRS; Grenoble INP,; GIPSA-lab; Grenoble; France; (2) Arturia; Meylan; France; (3) INRIA,; Perception Team; Montbonnot; France)

arXiv:1806.04096·eess.AS·May 27, 2019·22 cites

Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models

Fanny Roche (1, 2), Thomas Hueber (1), Samuel Limier (2) and, Laurent Girin (1, 3) ((1) Univ. Grenoble Alpes, CNRS, Grenoble INP,, GIPSA-lab, Grenoble, France, (2) Arturia, Meylan, France, (3) INRIA,, Perception Team, Montbonnot, France)

PDF

Open Access

TL;DR

This paper compares various autoencoder architectures and PCA for music sound modeling, finding deep and recurrent autoencoders outperform PCA in reconstruction accuracy, while variational autoencoders offer a good balance of quality and latent space usability.

Contribution

It provides a systematic comparison of linear, shallow, deep, recurrent, and variational autoencoders against PCA for music spectrum compression and synthesis.

Findings

01

Deep and recurrent autoencoders outperform PCA in reconstruction error.

02

PCA outperforms shallow autoencoders in this task.

03

Variational autoencoders balance reconstruction quality and latent space usability.

Abstract

This study investigates the use of non-linear unsupervised dimensionality reduction techniques to compress a music dataset into a low-dimensional representation which can be used in turn for the synthesis of new sounds. We systematically compare (shallow) autoencoders (AEs), deep autoencoders (DAEs), recurrent autoencoders (with Long Short-Term Memory cells -- LSTM-AEs) and variational autoencoders (VAEs) with principal component analysis (PCA) for representing the high-resolution short-term magnitude spectrum of a large and dense dataset of music notes into a lower-dimensional vector (and then convert it back to a magnitude spectrum used for sound resynthesis). Our experiments were conducted on the publicly available multi-instrument and multi-pitch database NSynth. Interestingly and contrary to the recent literature on image processing, we can show that PCA systematically outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies