A Fast and Lightweight Model for Causal Audio-Visual Speech Separation

Wendi Sang; Kai Li; Runxuan Yang; Jianqiang Huang; Xiaolin Hu

arXiv:2506.06689·cs.SD·October 15, 2025

A Fast and Lightweight Model for Causal Audio-Visual Speech Separation

Wendi Sang, Kai Li, Runxuan Yang, Jianqiang Huang, Xiaolin Hu

PDF

Open Access

TL;DR

Swift-Net is a novel, lightweight, causal audio-visual speech separation model designed for real-time applications, effectively integrating visual cues and historical information to improve speech separation performance in complex environments.

Contribution

The paper introduces Swift-Net, a causal, lightweight AVSS model with a new fusion module and Grouped SRUs, enabling real-time speech separation with improved efficiency and performance.

Findings

01

Outperforms existing models on benchmark datasets

02

Operates effectively in real-time scenarios

03

Demonstrates robustness in complex environments

Abstract

Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Adaptive Filtering Techniques