Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension
Zhen-Hua Ling, Yang Ai, Yu Gu, Li-Rong Dai

TL;DR
This paper introduces a hierarchical recurrent neural network approach for direct waveform-based speech bandwidth extension, outperforming traditional vocoder and other neural methods in quality and efficiency.
Contribution
The paper proposes a novel HRNN model that directly predicts waveform samples for speech BWE, incorporating hierarchical LSTM layers and auxiliary features for improved quality.
Findings
HRNN outperforms DCNN and SRNN in speech quality and efficiency.
The method surpasses traditional vocoder-based BWE in subjective quality.
Hierarchical structure captures long-term dependencies effectively.
Abstract
This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods which predict spectral parameters for reconstructing wideband speech waveforms, this BWE method models and predicts waveform samples directly without using vocoders. Inspired by SampleRNN which is an unconditional neural audio generator, the HRNN model represents the distribution of each wideband or high-frequency waveform sample conditioned on the input narrowband waveform samples using a neural network composed of long short-term memory (LSTM) layers and feed-forward (FF) layers. The LSTM layers form a hierarchical structure and each layer operates at a specific temporal resolution to efficiently capture long-span dependencies between temporal sequences. Furthermore, additional conditions, such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
