Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement

Meng-Ping Lin; Jen-Cheng Hou; Chia-Wei Chen; Shao-Yi Chien; Jun-Cheng Chen; Xugang Lu; and Yu Tsao

arXiv:2501.13375·cs.SD·May 27, 2025

Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement

Meng-Ping Lin, Jen-Cheng Hou, Chia-Wei Chen, Shao-Yi Chien, Jun-Cheng Chen, Xugang Lu, and Yu Tsao

PDF

Open Access

TL;DR

This paper introduces DLAV-SE, a diffusion-based multi-modal speech enhancement framework that integrates audio, visual, and linguistic data, significantly improving speech quality by effectively bridging modality gaps through cross-modal knowledge transfer.

Contribution

The paper proposes a novel diffusion-based AVSE model that incorporates linguistic knowledge via a pretrained language model and a cross-modal transfer mechanism, which is embedded into the model during training.

Findings

01

DLAV-SE outperforms state-of-the-art methods in speech quality.

02

The approach reduces phonetic confusion and artifacts.

03

Visualization confirms improved output quality.

Abstract

Speech enhancement (SE) aims to improve the quality and intelligibility of speech in noisy environments. Recent studies have shown that incorporating visual cues in audio signal processing can enhance SE performance. Given that human speech communication naturally involves audio, visual, and linguistic modalities, it is reasonable to expect additional improvements by integrating linguistic information. However, effectively bridging these modality gaps, particularly during knowledge transfer remains a significant challenge. In this paper, we propose a novel multi-modal learning framework, termed DLAV-SE, which leverages a diffusion-based model integrating audio, visual, and linguistic information for audio-visual speech enhancement (AVSE). Within this framework, the linguistic modality is modeled using a pretrained language model (PLM), which transfers linguistic knowledge to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing

MethodsDiffusion