Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing

Hao Shi; Yusuke Fujita; Roman Koshkin; Mengjie Zhao; Yuan Gao; Lianbo Liu; Yui Sudo

arXiv:2603.10587·cs.SD·March 12, 2026

Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing

Hao Shi, Yusuke Fujita, Roman Koshkin, Mengjie Zhao, Yuan Gao, Lianbo Liu, Yui Sudo

PDF

Open Access

TL;DR

This paper introduces an encoder-only multi-talker speech recognition model that distills semantic priors from large language models, enabling efficient decoding and improved performance in multi-talker scenarios.

Contribution

It presents a novel encoder-only framework that incorporates LLM-derived semantic guidance and a talker-count routing mechanism for flexible multi-talker ASR.

Findings

01

Achieves comparable performance to LLM-based systems in two-talker scenarios.

02

Significantly improves three-talker recognition accuracy.

03

Maintains low real-time factor for efficient inference.

Abstract

Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encoder-only MT-ASR framework that adapts an LLM to multi-talker conditioning and distills its semantic guidance into the encoder during training, while retaining fast CTC-style decoding at inference. Our model employs a post-encoder separator with serialized CTC to produce talker-ordered transcripts, and leverages an adapted LLM-based SOT objective as a multi-talker-aware teacher signal to explicitly regularize mixed-speech representations. To further support variable numbers of talkers, we introduce a Talker-Count Head that predicts the talker count and dynamically selects the appropriate decoding branch.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders