Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing
Hao Shi, Yusuke Fujita, Roman Koshkin, Mengjie Zhao, Yuan Gao, Lianbo Liu, Yui Sudo

TL;DR
This paper introduces an encoder-only multi-talker speech recognition model that distills semantic priors from large language models, enabling efficient decoding and improved performance in multi-talker scenarios.
Contribution
It presents a novel encoder-only framework that incorporates LLM-derived semantic guidance and a talker-count routing mechanism for flexible multi-talker ASR.
Findings
Achieves comparable performance to LLM-based systems in two-talker scenarios.
Significantly improves three-talker recognition accuracy.
Maintains low real-time factor for efficient inference.
Abstract
Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encoder-only MT-ASR framework that adapts an LLM to multi-talker conditioning and distills its semantic guidance into the encoder during training, while retaining fast CTC-style decoding at inference. Our model employs a post-encoder separator with serialized CTC to produce talker-ordered transcripts, and leverages an adapted LLM-based SOT objective as a multi-talker-aware teacher signal to explicitly regularize mixed-speech representations. To further support variable numbers of talkers, we introduce a Talker-Count Head that predicts the talker count and dynamically selects the appropriate decoding branch.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
