Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement

Shafique Ahmed; Chia-Wei Chen; Wenze Ren; Chin-Jou Li; Ernie Chu,; Jun-Cheng Chen; Amir Hussain; Hsin-Min Wang; Yu Tsao; and Jen-Cheng Hou

arXiv:2309.11059·eess.AS·October 10, 2023

Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement

Shafique Ahmed, Chia-Wei Chen, Wenze Ren, Chin-Jou Li, Ernie Chu,, Jun-Cheng Chen, Amir Hussain, Hsin-Min Wang, Yu Tsao, and Jen-Cheng Hou

PDF

Open Access

TL;DR

This paper introduces DCUC-Net, a novel audio-visual speech enhancement model that combines complex domain features and conformer blocks to effectively utilize visual data, outperforming existing models in PESQ scores.

Contribution

The paper presents a new deep complex U-Net with conformer architecture for audio-visual speech enhancement, integrating complex features and self-attention for improved performance.

Findings

01

Outperforms baseline model by 0.14 PESQ points.

02

Performs comparably to state-of-the-art models.

03

Outperforms all compared models on TMSV dataset.

Abstract

Recent studies have increasingly acknowledged the advantages of incorporating visual data into speech enhancement (SE) systems. In this paper, we introduce a novel audio-visual SE approach, termed DCUC-Net (deep complex U-Net with conformer network). The proposed DCUC-Net leverages complex domain features and a stack of conformer blocks. The encoder and decoder of DCUC-Net are designed using a complex U-Net-based framework. The audio and visual signals are processed using a complex encoder and a ResNet-18 model, respectively. These processed signals are then fused using the conformer blocks and transformed into enhanced speech waveforms via a complex decoder. The conformer blocks consist of a combination of self-attention mechanisms and convolutional operations, enabling DCUC-Net to effectively capture both global and local audio-visual dependencies. Our experimental results demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media · Hearing Loss and Rehabilitation

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Convolution · Concatenated Skip Connection · U-Net