MelNet: A Generative Model for Audio in the Frequency Domain
Sean Vasquez, Mike Lewis

TL;DR
MelNet introduces a novel generative model that operates in the frequency domain to produce high-fidelity audio, capturing long-range dependencies more effectively than time-domain models.
Contribution
The paper presents a new probabilistic model leveraging 2D time-frequency representations and multiscale generation for improved audio synthesis.
Findings
Outperforms previous models in density estimation
Achieves higher human judgment scores
Effective across speech, music, and TTS tasks
Abstract
Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms. By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve. We apply our model to a variety of audio generation tasks, including unconditional speech generation, music generation, and text-to-speech synthesis---showing improvements over previous approaches in both density estimates and human judgments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
