SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight   Conv-TasNet and State Space Modeling

Hiroshi Sato; Takafumi Moriya; Masato Mimura; Shota Horiguchi; Tsubasa; Ochiai; Takanori Ashihara; Atsushi Ando; Kentaro Shinayama; Marc Delcroix

arXiv:2407.01857·eess.AS·July 3, 2024

SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa, Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

PDF

Open Access

TL;DR

This paper presents SpeakerBeam-SS, a lightweight real-time target speaker extraction method using Conv-TasNet with state space modeling, significantly reducing computational cost while maintaining performance.

Contribution

It introduces a novel architecture combining state space modeling with Conv-TasNet, reducing complexity and enabling efficient real-time target speaker extraction.

Findings

01

Reduces real-time factor by 78% compared to conventional methods.

02

Maintains performance despite reduced complexity.

03

Effective modeling of long-term dependencies with fewer layers.

Abstract

Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been shown to model long-term dependency effectively. Owing to SSM, fewer dilated convolutional layers are required to capture temporal dependency in Conv-TasNet, resulting in the reduction of model complexity. We also enlarge the window length and shift of the convolutional (TasNet) frontend encoder to reduce the computational cost further; the performance decline is compensated by over-parameterization of the frontend encoder. The proposed method reduces the real-time factor by 78% from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing