An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms
Anastasia Natsiou, Luca Longo, Sean O'Leary

TL;DR
This paper explores the use of stacked convolutional autoencoders for compressing and reconstructing monophonic harmonic sounds from log-mel-spectrograms, demonstrating effective unsupervised audio representation learning.
Contribution
It introduces a novel application of convolutional autoencoders for audio compression and proposes an evaluation metric based on frequency accuracy for harmonic sound reconstruction.
Findings
Autoencoders successfully reconstruct harmonic sounds from compressed representations
Hyper-parameter tuning improves reconstruction quality
Frequency accuracy correlates with perceived sound quality
Abstract
In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand. These representations can be used to manipulate the timbre and influence the synthesis of creative instrumental notes. Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument timbre compression. Unsupervised deep learning methods can achieve audio compression by training the network to learn a mapping from waveforms or spectrograms to low-dimensional representations. This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch. Further exploration of hyper-parameters and regularization techniques is demonstrated to enhance the performance of the initial design. In an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
