RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual   Speech Separation

Samuel Pegg; Kai Li; Xiaolin Hu

arXiv:2309.17189·cs.SD·March 22, 2024·1 cites

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

Samuel Pegg, Kai Li, Xiaolin Hu

PDF

Open Access 1 Repo 1 Video

TL;DR

RTFS-Net introduces a novel time-frequency domain approach for audio-visual speech separation, leveraging RNNs and attention-based fusion to outperform existing methods in speed and quality with fewer parameters.

Contribution

The paper presents RTFS-Net, a new time-frequency domain model that uses RNNs and attention mechanisms for efficient and superior audio-visual speech separation.

Findings

01

Outperforms prior SOTA in separation quality

02

Reduces model parameters by 90%

03

Decreases computational complexity by 83%

Abstract

Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

spkgyk/RTFS-Net
pytorchOfficial

Videos

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Hearing Loss and Rehabilitation

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings