DualSep: A Light-weight dual-encoder convolutional recurrent network for   real-time in-car speech separation

Ziqian Wang; Jiayao Sun; Zihan Zhang; Xingchen Li; Jie Liu; Lei Xie

arXiv:2409.08610·eess.AS·September 16, 2024

DualSep: A Light-weight dual-encoder convolutional recurrent network for real-time in-car speech separation

Ziqian Wang, Jiayao Sun, Zihan Zhang, Xingchen Li, Jie Liu, Lei Xie

PDF

Open Access

TL;DR

DualSep is a lightweight, real-time in-car speech separation system combining DSP and neural networks, utilizing dual encoders for spatial and spectral cues, achieving high performance with minimal computational resources.

Contribution

Introduces a novel dual-encoder convolutional recurrent network that effectively separates in-car speech in real-time with low computational cost.

Findings

01

Achieves 0.83M parameters and 0.39 RTF on CPU.

02

Outperforms existing methods across various metrics.

03

Supports both streaming and non-streaming modes.

Abstract

Advancements in deep learning and voice-activated technologies have driven the development of human-vehicle interaction. Distributed microphone arrays are widely used in in-car scenarios because they can accurately capture the voices of passengers from different speech zones. However, the increase in the number of audio channels, coupled with the limited computational resources and low latency requirements of in-car systems, presents challenges for in-car multi-channel speech separation. To migrate the problems, we propose a lightweight framework that cascades digital signal processing (DSP) and neural networks (NN). We utilize fixed beamforming (BF) to reduce computational costs and independent vector analysis (IVA) to provide spatial prior. We employ dual encoders for dual-branch modeling, with spatial encoder capturing spatial cues and spectral encoder preserving spectral…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing