Closing the Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions
Wangyou Zhang, Jing Shi, Chenda Li, Shinji Watanabe, Yanmin Qian

TL;DR
This paper investigates the generalization of multi-channel Conv-TasNet speech enhancement from simulated to real-world data, proposing methods to close the performance gap and improve real-world speech recognition accuracy.
Contribution
It introduces strategies to adapt multi-channel Conv-TasNet for real data, including integration with beamforming and joint training with speech recognition models.
Findings
Significant reduction in ASR performance gap between simulation and real data.
Enhanced speech recognition accuracy on CHiME-4 with proposed methods.
Maintained strong speech enhancement capabilities in real-world scenarios.
Abstract
The deep learning based time-domain models, e.g. Conv-TasNet, have shown great potential in both single-channel and multi-channel speech enhancement. However, many experiments on the time-domain speech enhancement model are done in simulated conditions, and it is not well studied whether the good performance can generalize to real-world scenarios. In this paper, we aim to provide an insightful investigation of applying multi-channel Conv-TasNet based speech enhancement to both simulation and real data. Our preliminary experiments show a large performance gap between the two conditions in terms of the ASR performance. Several approaches are applied to close this gap, including the integration of multi-channel Conv-TasNet into the beamforming model with various strategies, and the joint training of speech enhancement and speech recognition models. Our experiments on the CHiME-4 corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Gait Recognition and Analysis
