Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio

Mohan Shi; Xiong Xiao; Ruchao Fan; Shaoshi Ling; Jinyu Li

arXiv:2511.16046·eess.AS·November 21, 2025

Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio

Mohan Shi, Xiong Xiao, Ruchao Fan, Shaoshi Ling, Jinyu Li

PDF

Open Access

TL;DR

This paper introduces JEDIS-LLM, a Speech-LLM trained on short audio that can perform streamable, zero-shot joint ASR and diarization on long audio, outperforming existing methods without additional training.

Contribution

The paper presents a novel Speech-LLM with a Speaker Prompt Cache enabling zero-shot, streamable joint ASR and diarization on long audio, trained solely on short clips.

Findings

01

Outperforms strong baselines on short and long audio

02

Enables zero-shot inference on long audio without retraining

03

Achieves state-of-the-art performance in joint ASR and diarization

Abstract

Joint automatic speech recognition (ASR) and speaker diarization aim to answer the question "who spoke what" in multi-speaker scenarios. In this paper, we present an end-to-end speech large language model (Speech-LLM) for Joint strEamable DIarization and aSr (JEDIS-LLM). The model is trained only on short audio under 20s but is capable of streamable inference on long-form audio without additional training. This is achieved by introducing a Speaker Prompt Cache (SPC) with an on-the-fly update mechanism during chunk-wise streaming inference, inspired by the autoregressive nature of LLMs. The SPC also allows the seamless use of pre-enrolled speaker profiles which is common in many scenarios like meeting transcription. To further enhance diarization capability, we incorporate word-level speaker supervision into the speech encoder during training. Experimental results demonstrate that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Topic Modeling