Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement
Shafique Ahmed, Chia-Wei Chen, Wenze Ren, Chin-Jou Li, Ernie Chu,, Jun-Cheng Chen, Amir Hussain, Hsin-Min Wang, Yu Tsao, and Jen-Cheng Hou

TL;DR
This paper introduces DCUC-Net, a novel audio-visual speech enhancement model that combines complex domain features and conformer blocks to effectively utilize visual data, outperforming existing models in PESQ scores.
Contribution
The paper presents a new deep complex U-Net with conformer architecture for audio-visual speech enhancement, integrating complex features and self-attention for improved performance.
Findings
Outperforms baseline model by 0.14 PESQ points.
Performs comparably to state-of-the-art models.
Outperforms all compared models on TMSV dataset.
Abstract
Recent studies have increasingly acknowledged the advantages of incorporating visual data into speech enhancement (SE) systems. In this paper, we introduce a novel audio-visual SE approach, termed DCUC-Net (deep complex U-Net with conformer network). The proposed DCUC-Net leverages complex domain features and a stack of conformer blocks. The encoder and decoder of DCUC-Net are designed using a complex U-Net-based framework. The audio and visual signals are processed using a complex encoder and a ResNet-18 model, respectively. These processed signals are then fused using the conformer blocks and transformed into enhanced speech waveforms via a complex decoder. The conformer blocks consist of a combination of self-attention mechanisms and convolutional operations, enabling DCUC-Net to effectively capture both global and local audio-visual dependencies. Our experimental results demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media · Hearing Loss and Rehabilitation
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Convolution · Concatenated Skip Connection · U-Net
