Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

Yiwen Shao; Shi-Xiong Zhang; Yong Xu; Meng Yu; Dong Yu; Daniel Povey,; Sanjeev Khudanpur

arXiv:2406.09589·eess.AS·June 19, 2024·Interspeech

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey,, Sanjeev Khudanpur

PDF

Open Access

TL;DR

This paper presents Solo-SF, a novel method that improves multi-channel, multi-speaker ASR by using a target speaker's isolated segment, achieving lower error rates without relying on microphone array configurations.

Contribution

Introducing Solo-SF, a new approach that leverages solo speech segments to enhance target speaker recognition in multi-channel ASR, bypassing traditional spatial information requirements.

Findings

01

Solo-SF outperforms existing methods in CER reduction.

02

Effective solo segment selection strategies are crucial for Solo-SF.

03

Demonstrated robustness across datasets and noise conditions.

Abstract

In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (Solo-SF), an innovative method that utilizes a target speaker's isolated speech segment to enhance ASR performance, thereby circumventing the need for conventional inputs like microphone array layouts. We explore effective strategies for selecting optimal solo segments, a crucial aspect for Solo-SF's success. Through evaluations conducted on the AliMeeting dataset and AISHELL-1 simulations, Solo-SF demonstrates superior performance over existing techniques, significantly lowering Character Error…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis