Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments

Reo Yoneyama; Masaya Kawamura; Ryo Terashima; Ryuichi Yamamoto; Tomoki Toda

arXiv:2506.03554·cs.SD·June 5, 2025

Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments

Reo Yoneyama, Masaya Kawamura, Ryo Terashima, Ryuichi Yamamoto, Tomoki Toda

PDF

Open Access

TL;DR

This paper introduces MS-Wavehax, a neural vocoder optimized for low-latency streaming speech synthesis in resource-limited settings, balancing latency, quality, and computational efficiency.

Contribution

It extends Wavehax with multi-stream decomposition, analyzes latency-throughput trade-offs, and provides practical guidelines for deployment in resource-constrained environments.

Findings

01

MS-Wavehax achieves high speech quality in streaming conditions.

02

The analysis identifies key bottlenecks and optimization strategies.

03

The vocoder is compact and suitable for resource-limited devices.

Abstract

In real-time speech synthesis, neural vocoders often require low-latency synthesis through causal processing and streaming. However, streaming introduces inefficiencies absent in batch synthesis, such as limited parallelism, inter-frame dependency management, and parameter loading overhead. This paper proposes multi-stream Wavehax (MS-Wavehax), an efficient neural vocoder for low-latency streaming, by extending the aliasing-free neural vocoder Wavehax with multi-stream decomposition. We analyze the latency-throughput trade-off in a CPU-only environment and identify key bottlenecks in streaming neural vocoders. Our findings provide practical insights for optimizing chunk sizes and designing vocoders tailored to specific application demands and hardware constraints. Furthermore, our subjective evaluations show that MS-Wavehax delivers high speech quality under causal and non-causal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques