A Dual-Staged Context Aggregation Method Towards Efficient End-To-End Speech Enhancement
Kai Zhen, Mi Suk Lee, Minje Kim

TL;DR
This paper introduces DCCRN, a hybrid neural network architecture that efficiently aggregates temporal context for end-to-end speech enhancement, achieving superior performance with low complexity.
Contribution
The paper proposes a novel densely connected hybrid network architecture for dual-staged context aggregation in end-to-end speech enhancement.
Findings
DCCRN outperforms baseline models in STOI and PESQ scores.
The model is computationally efficient with only 1.38 million parameters.
It maintains decent generalizability to unseen noise types.
Abstract
In speech enhancement, an end-to-end deep neural network converts a noisy speech signal to a clean speech directly in time domain without time-frequency transformation or mask estimation. However, aggregating contextual information from a high-resolution time domain signal with an affordable model complexity still remains challenging. In this paper, we propose a densely connected convolutional and recurrent network (DCCRN), a hybrid architecture, to enable dual-staged temporal context aggregation. With the dense connectivity and cross-component identical shortcut, DCCRN consistently outperforms competing convolutional baselines with an average STOI improvement of 0.23 and PESQ of 1.38 at three SNR levels. The proposed method is computationally efficient with only 1.38 million parameters. The generalizability performance on the unseen noise types is still decent considering its low…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Indoor and Outdoor Localization Technologies · Advanced Adaptive Filtering Techniques
