SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling
Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa, Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

TL;DR
This paper presents SpeakerBeam-SS, a lightweight real-time target speaker extraction method using Conv-TasNet with state space modeling, significantly reducing computational cost while maintaining performance.
Contribution
It introduces a novel architecture combining state space modeling with Conv-TasNet, reducing complexity and enabling efficient real-time target speaker extraction.
Findings
Reduces real-time factor by 78% compared to conventional methods.
Maintains performance despite reduced complexity.
Effective modeling of long-term dependencies with fewer layers.
Abstract
Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been shown to model long-term dependency effectively. Owing to SSM, fewer dilated convolutional layers are required to capture temporal dependency in Conv-TasNet, resulting in the reduction of model complexity. We also enlarge the window length and shift of the convolutional (TasNet) frontend encoder to reduce the computational cost further; the performance decline is compensated by over-parameterization of the frontend encoder. The proposed method reduces the real-time factor by 78% from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
