RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
Samuel Pegg, Kai Li, Xiaolin Hu

TL;DR
RTFS-Net introduces a novel time-frequency domain approach for audio-visual speech separation, leveraging RNNs and attention-based fusion to outperform existing methods in speed and quality with fewer parameters.
Contribution
The paper presents RTFS-Net, a new time-frequency domain model that uses RNNs and attention mechanisms for efficient and superior audio-visual speech separation.
Findings
Outperforms prior SOTA in separation quality
Reduces model parameters by 90%
Decreases computational complexity by 83%
Abstract
Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Hearing Loss and Rehabilitation
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
