FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional   Flow Matching

Chaeyoung Jung; Suyeon Lee; Ji-Hoon Kim; Joon Son Chung

arXiv:2406.09286·eess.AS·June 14, 2024

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

PDF

Open Access

TL;DR

FlowAVSE introduces a fast, efficient audio-visual speech enhancement method using conditional flow matching, significantly improving inference speed and reducing model size without sacrificing quality.

Contribution

It presents a novel conditional flow matching approach that enables high-quality speech generation in a single step, optimizing the U-net architecture for efficiency.

Findings

01

22 times faster inference speed

02

Model size reduced by half

03

Maintains high output quality

Abstract

This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the number of learnable parameters without degrading the output quality. In particular, we employ a conditional flow matching algorithm that enables the generation of high-quality speech in a single sampling step. Moreover, we increase efficiency by optimizing the underlying U-net architecture of diffusion-based systems. Our experiments demonstrate that FlowAVSE achieves 22 times faster inference speed and reduces the model size by half while maintaining the output quality. The demo page is available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Advanced Data Compression Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Convolution · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Max Pooling · U-Net