MelNet: A Generative Model for Audio in the Frequency Domain

Sean Vasquez; Mike Lewis

arXiv:1906.01083·eess.AS·June 5, 2019·111 cites

MelNet: A Generative Model for Audio in the Frequency Domain

Sean Vasquez, Mike Lewis

PDF

Open Access 5 Repos

TL;DR

MelNet introduces a novel generative model that operates in the frequency domain to produce high-fidelity audio, capturing long-range dependencies more effectively than time-domain models.

Contribution

The paper presents a new probabilistic model leveraging 2D time-frequency representations and multiscale generation for improved audio synthesis.

Findings

01

Outperforms previous models in density estimation

02

Achieves higher human judgment scores

03

Effective across speech, music, and TTS tasks

Abstract

Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms. By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve. We apply our model to a variety of audio generation tasks, including unconditional speech generation, music generation, and text-to-speech synthesis---showing improvements over previous approaches in both density estimates and human judgments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing