Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement
Meng-Ping Lin, Jen-Cheng Hou, Chia-Wei Chen, Shao-Yi Chien, Jun-Cheng Chen, Xugang Lu, and Yu Tsao

TL;DR
This paper introduces DLAV-SE, a diffusion-based multi-modal speech enhancement framework that integrates audio, visual, and linguistic data, significantly improving speech quality by effectively bridging modality gaps through cross-modal knowledge transfer.
Contribution
The paper proposes a novel diffusion-based AVSE model that incorporates linguistic knowledge via a pretrained language model and a cross-modal transfer mechanism, which is embedded into the model during training.
Findings
DLAV-SE outperforms state-of-the-art methods in speech quality.
The approach reduces phonetic confusion and artifacts.
Visualization confirms improved output quality.
Abstract
Speech enhancement (SE) aims to improve the quality and intelligibility of speech in noisy environments. Recent studies have shown that incorporating visual cues in audio signal processing can enhance SE performance. Given that human speech communication naturally involves audio, visual, and linguistic modalities, it is reasonable to expect additional improvements by integrating linguistic information. However, effectively bridging these modality gaps, particularly during knowledge transfer remains a significant challenge. In this paper, we propose a novel multi-modal learning framework, termed DLAV-SE, which leverages a diffusion-based model integrating audio, visual, and linguistic information for audio-visual speech enhancement (AVSE). Within this framework, the linguistic modality is modeled using a pretrained language model (PLM), which transfers linguistic knowledge to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
MethodsDiffusion
