META-CAT: Speaker-Informed Speech Embeddings via Meta Information   Concatenation for Multi-talker ASR

Jinhan Wang; Weiqing Wang; Kunal Dhawan; Taejin Park; Myungjong Kim,; Ivan Medennikov; He Huang; Nithin Koluguri; Jagadeesh Balam; Boris Ginsburg

arXiv:2409.12352·eess.AS·September 20, 2024

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

Jinhan Wang, Weiqing Wang, Kunal Dhawan, Taejin Park, Myungjong Kim,, Ivan Medennikov, He Huang, Nithin Koluguri, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access

TL;DR

This paper introduces META-CAT, a novel end-to-end multi-talker ASR framework that uses speaker information concatenation to improve recognition of multiple and target speakers without complex filtering mechanisms.

Contribution

The paper presents a new Meta-Cat method for masking encoder activations using speaker supervision, enabling effective multi-talker and target-speaker ASR in a unified, end-to-end architecture.

Findings

01

Achieves competitive performance in MS-ASR and TS-ASR tasks.

02

Eliminates need for neural mask estimation or masking at audio/feature level.

03

Demonstrates a unified dual-task model for multi-talker ASR.

Abstract

We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. Our proposed model is trained in a fully end-to-end manner, incorporating speaker supervision from a pre-trained speaker diarization module. We introduce an intuitive yet effective method for masking ASR encoder activations using output from the speaker supervision module, a technique we term Meta-Cat (meta-information concatenation), that can be applied to both MS-ASR and TS-ASR. Our results demonstrate that the proposed architecture achieves competitive performance in both MS-ASR and TS-ASR tasks, without the need for traditional methods, such as neural mask estimation or masking at the audio or feature level. Furthermore, we demonstrate a glimpse of a unified dual-task model which can efficiently handle both MS-ASR and TS-ASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing