META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR
Jinhan Wang, Weiqing Wang, Kunal Dhawan, Taejin Park, Myungjong Kim,, Ivan Medennikov, He Huang, Nithin Koluguri, Jagadeesh Balam, Boris Ginsburg

TL;DR
This paper introduces META-CAT, a novel end-to-end multi-talker ASR framework that uses speaker information concatenation to improve recognition of multiple and target speakers without complex filtering mechanisms.
Contribution
The paper presents a new Meta-Cat method for masking encoder activations using speaker supervision, enabling effective multi-talker and target-speaker ASR in a unified, end-to-end architecture.
Findings
Achieves competitive performance in MS-ASR and TS-ASR tasks.
Eliminates need for neural mask estimation or masking at audio/feature level.
Demonstrates a unified dual-task model for multi-talker ASR.
Abstract
We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. Our proposed model is trained in a fully end-to-end manner, incorporating speaker supervision from a pre-trained speaker diarization module. We introduce an intuitive yet effective method for masking ASR encoder activations using output from the speaker supervision module, a technique we term Meta-Cat (meta-information concatenation), that can be applied to both MS-ASR and TS-ASR. Our results demonstrate that the proposed architecture achieves competitive performance in both MS-ASR and TS-ASR tasks, without the need for traditional methods, such as neural mask estimation or masking at the audio or feature level. Furthermore, we demonstrate a glimpse of a unified dual-task model which can efficiently handle both MS-ASR and TS-ASR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
