FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

TL;DR
FlowAVSE introduces a fast, efficient audio-visual speech enhancement method using conditional flow matching, significantly improving inference speed and reducing model size without sacrificing quality.
Contribution
It presents a novel conditional flow matching approach that enables high-quality speech generation in a single step, optimizing the U-net architecture for efficiency.
Findings
22 times faster inference speed
Model size reduced by half
Maintains high output quality
Abstract
This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the number of learnable parameters without degrading the output quality. In particular, we employ a conditional flow matching algorithm that enables the generation of high-quality speech in a single sampling step. Moreover, we increase efficiency by optimizing the underlying U-net architecture of diffusion-based systems. Our experiments demonstrate that FlowAVSE achieves 22 times faster inference speed and reduces the model size by half while maintaining the output quality. The demo page is available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Advanced Data Compression Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Convolution · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Max Pooling · U-Net
