Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Jiarong Du; Zhan Jin; Peijun Yang; Juan Liu; Zhuo Li; Xin Liu; Ming Li

arXiv:2510.26825·cs.SD·November 3, 2025

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Jiarong Du, Zhan Jin, Peijun Yang, Juan Liu, Zhuo Li, Xin Liu, Ming Li

PDF

Open Access

TL;DR

This paper introduces a novel audio-visual speech enhancement system designed for complex acoustic environments, employing a separation before dereverberation pipeline, which outperforms previous methods and achieves top results in a competitive challenge.

Contribution

The paper proposes a new AVSE approach with a separation-then-dereverberation pipeline, enhancing speech quality in complex scenarios and extending applicability to other AVSE networks.

Findings

01

Achieved top performance in objective metrics on AVSEC-4

02

Secured first place in human subjective listening test

03

Demonstrated robustness in complex acoustic environments

Abstract

Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Adaptive Filtering Techniques