Lightweight Dual-channel Target Speaker Separation for Mobile Voice   Communication

Yuanyuan Bao; Yanze Xu; Na Xu; Wenjing Yang; Hongfeng Li; Shicong Li,; Yongtao Jia; Fei Xiang; Jincheng He; Ming Li

arXiv:2106.02934·cs.SD·June 8, 2021

Lightweight Dual-channel Target Speaker Separation for Mobile Voice Communication

Yuanyuan Bao, Yanze Xu, Na Xu, Wenjing Yang, Hongfeng Li, Shicong Li,, Yongtao Jia, Fei Xiang, Jincheng He, Ming Li

PDF

Open Access

TL;DR

This paper introduces a lightweight dual-channel model, LSTM-Former, for target speaker separation on mobile devices, using a new dual-channel dataset, LibriPhone, to improve performance in real-world scenarios.

Contribution

The paper presents a novel dual-channel dataset, LibriPhone, and a lightweight LSTM-Former model optimized for mobile target speaker separation tasks.

Findings

01

Dual-channel LSTM-Former outperforms single-channel by 25%

02

LibriPhone dataset mimics real-world mobile scenarios

03

Lightweight model suitable for mobile deployment

Abstract

Nowadays, there is a strong need to deploy the target speaker separation (TSS) model on mobile devices with a limitation of the model size and computational complexity. To better perform TSS for mobile voice communication, we first make a dual-channel dataset based on a specific scenario, LibriPhone. Specifically, to better mimic the real-case scenario, instead of simulating from the single-channel dataset, LibriPhone is made by simultaneously replaying pairs of utterances from LibriSpeech by two professional artificial heads and recording by two built-in microphones of the mobile. Then, we propose a lightweight time-frequency domain separation model, LSTM-Former, which is based on the LSTM framework with source-to-noise ratio (SI-SNR) loss. For the experiments on Libri-Phone, we explore the dual-channel LSTMFormer model and a single-channel version by a random single channel of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing