SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis
Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu,, Joseph E. Gonzalez, Kurt Keutzer

TL;DR
SqueezeWave introduces a highly efficient, lightweight vocoder based on WaveGlow that achieves near-parallel quality speech synthesis suitable for real-time on-edge devices, with significantly reduced computational cost.
Contribution
The paper presents SqueezeWave, a novel lightweight vocoder that maintains high-quality speech synthesis while drastically reducing MACs compared to WaveGlow.
Findings
SqueezeWave reduces MACs by 61x to 214x compared to WaveGlow.
SqueezeWave achieves similar audio quality to WaveGlow.
The model is suitable for real-time on-device speech synthesis.
Abstract
Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs. Code, trained models, and generated audio are publicly available at https://github.com/tianrengao/SqueezeWave.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and dialogue systems
