Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks
Sercan O. Arik, Heewoo Jun, and Gregory Diamos

TL;DR
This paper introduces MCNN, a multi-head convolutional neural network architecture that enables extremely fast and high-quality speech waveform synthesis directly from spectrograms, outperforming traditional iterative methods in speed and efficiency.
Contribution
The paper presents a novel multi-head CNN architecture for spectrogram inversion that achieves over 300x real-time synthesis without iterative algorithms, improving efficiency and quality.
Findings
MCNN achieves over 300x real-time waveform synthesis.
It outperforms iterative algorithms like Griffin-Lim in efficiency.
The approach produces high-quality speech without autoregression.
Abstract
We propose the multi-head convolutional neural network (MCNN) architecture for waveform synthesis from spectrograms. Nonlinear interpolation in MCNN is employed with transposed convolution layers in parallel heads. MCNN achieves more than an order of magnitude higher compute intensity than commonly-used iterative algorithms like Griffin-Lim, yielding efficient utilization for modern multi-core processors, and very fast (more than 300x real-time) waveform synthesis. For training of MCNN, we use a large-scale speech recognition dataset and losses defined on waveforms that are related to perceptual audio quality. We demonstrate that MCNN constitutes a very promising approach for high-quality speech synthesis, without any iterative algorithms or autoregression in computations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTransposed convolution · Convolution
