Parallel Synthesis for Autoregressive Speech Generation
Po-chun Hsu, Da-rong Liu, Andy T. Liu, and Hung-yi Lee

TL;DR
This paper introduces a novel parallel autoregressive speech synthesis method that significantly improves inference speed by generating speech in frequency subbands and bits, outperforming traditional autoregressive models in quality and efficiency.
Contribution
The paper proposes frequency-wise and bit-wise autoregressive generation techniques, reducing inference time from being proportional to utterance length to being dependent on subbands and bits, enhancing efficiency.
Findings
Achieves faster-than-real-time speech synthesis without GPU acceleration.
Outperforms baseline vocoders in MUSHRA quality scores.
Shows strong generalization to unseen speakers and high sampling rates.
Abstract
Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to low efficiency. Many works were dedicated to generating the whole speech sequence in parallel and proposed GAN-based, flow-based, and score-based vocoders. This paper proposed a new thought for the autoregressive generation. Instead of iteratively predicting samples in a time sequence, the proposed model performs frequency-wise autoregressive generation (FAR) and bit-wise autoregressive generation (BAR) to synthesize speech. In FAR, a speech utterance is split into frequency subbands, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
