Time-Domain Multi-modal Bone/air Conducted Speech Enhancement
Cheng Yu, Kuo-Hsuan Hung, Syu-Siang Wang, Szu-Wei Fu, Yu Tsao,, Jeih-weih Hung

TL;DR
This paper introduces a time-domain multi-modal speech enhancement system combining bone- and air-conducted signals, demonstrating significant performance improvements over single-source methods using deep learning and ensemble strategies.
Contribution
It presents a novel multi-modal SE framework utilizing bone- and air-conducted signals with ensemble fusion strategies, advancing speech enhancement techniques.
Findings
Multi-modal SE outperforms single-source SE in various metrics.
Late fusion strategy yields better results than early fusion.
The proposed method improves speech quality in Mandarin corpus experiments.
Abstract
Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources and thus may complicate the SE system. As an alternative source, a bone-conducted speech signal has a moderate data size while manifesting speech-phoneme structures, and thus complements its air-conducted counterpart. In this study, we propose a novel multi-modal SE structure in the time domain that leverages bone- and air-conducted signals. In addition, we examine two ensemble-learning-based strategies, early fusion (EF) and late fusion (LF), to integrate the two types of speech signals, and adopt a deep learning-based fully convolutional network to conduct the enhancement. The experiment results on the Mandarin corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
