TL;DR
This paper introduces CMGAN, a conformer-based GAN that effectively enhances speech quality by modeling local and global dependencies in the time-frequency domain, outperforming previous models.
Contribution
The paper presents a novel conformer-based generator and a metric discriminator for speech enhancement, improving speech quality metrics over prior methods.
Findings
Achieved PESQ of 3.41 and SSNR of 11.10 dB on Voice Bank+DEMAND dataset.
Outperforms previous speech enhancement models in quantitative evaluations.
Utilizes two-stage conformer blocks for comprehensive spectrogram modeling.
Abstract
Recently, convolution-augmented transformer (Conformer) has achieved promising performance in automatic speech recognition (ASR) and time-domain speech enhancement (SE), as it can capture both local and global dependencies in the speech signal. In this paper, we propose a conformer-based metric generative adversarial network (CMGAN) for SE in the time-frequency (TF) domain. In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information by modeling both time and frequency dependencies. The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech. In addition, a metric discriminator is employed to further improve the quality of the enhanced estimated speech by optimizing the generator with respect to a corresponding evaluation score.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
