Toward Complex-Valued Neural Networks for Waveform Generation
Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

TL;DR
This paper introduces ComVo, a complex-valued neural vocoder that leverages native complex arithmetic and phase quantization to improve waveform synthesis quality and training efficiency in neural vocoders.
Contribution
It presents a novel complex-valued neural vocoder architecture with phase discretization and a block-matrix computation scheme, advancing waveform generation methods.
Findings
ComVo outperforms real-valued baselines in synthesis quality.
The block-matrix scheme reduces training time by 25%.
Native complex arithmetic improves modeling of spectrogram structures.
Abstract
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding learned upsampling stages that can increase computational cost. However, current approaches use real-valued networks that process the real and imaginary parts independently. This separation limits their ability to capture the inherent structure of complex spectrograms. We present ComVo, a Complex-valued neural Vocoder whose generator and discriminator use native complex arithmetic. This enables an adversarial training framework that provides structured feedback in complex-valued representations. To guide phase transformations in a structured manner, we introduce phase quantization, which discretizes phase values…
Peer Reviews
Decision·ICLR 2026 Poster
The exploration of complex-valued networks for generative audio tasks is a compelling research direction, and the authors present a well-executed implementation. The paper is technically solid; the proposed phase quantization is an interesting inductive bias for stabilizing phase prediction, and the block-matrix formulation for accelerating training is a valuable engineering contribution that demonstrably reduces training time by 25%. The experimental evaluation is thorough, with comparisons aga
Despite the positive results, I have fundamental reservations about the paper's central motivation. The primary claim is that CVNNs are superior because they "capture the intrinsic dependencies between the real and imaginary components." However, this central hypothesis is asserted rather than rigorously validated. The performance gains, while present, do not in themselves prove that this specific mechanism is the cause. My main conceptual issue is that for a spectrogram to be perfectly invertib
Originality: The introduction of complex-valued neural networks (CVNNs) for waveform generation is an interesting and novel approach that is not widely explored in the context of vocoders. The proposed method shows potential in capturing the structure of complex spectrograms by treating them as unified complex entities. Quality: The paper is well-written, and the experimental setup is clearly described. The proposed method shows promising results in terms of both objective and subjective evalua
A major weakness of this paper is that it compares the proposed method only to real-valued vocoders and iSTFT-based methods. The paper does not include a comparison with vocoders that predict both amplitude and phase spectrograms (such as APNet and FreeV). These methods already integrate both real and imaginary parts in their amplitude and phase spectrogram predictions, which might address the issue the authors claim with real-valued networks. Without this comparison, it is difficult to conclusi
- The use of CVNNs demonstrates clear performance gains over real-valued architectures. - The paper presents various ablation studies, providing insight into the role of each proposed component in the overall system. The architectural design is well-motivated and supported by empirical evidence. - ComVo scales effectively, showing competitive performance across both lightweight and large model configurations. - The advantage of ComVo is maintained even when integrated into a TTS pipeline, highli
- For me, the proposed block-matrix computation seems more like an implementation-level optimization rather than a novel research contribution. - The ablation results for phase quantization appear relatively weak, offering limited evidence to justify its effectiveness. - Although Table 8 indicates that ComVo and the baselines have similar parameter counts, ComVo uses a different parameter datatype (e.g., ComVo’s parameters are stored in complex64 format, which effectively corresponds to two floa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
