TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding

Mingyue Huo; Yiwen Shao; Yuheng Zhang

arXiv:2601.06896·eess.AS·January 13, 2026

TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding

Mingyue Huo, Yiwen Shao, Yuheng Zhang

PDF

Open Access 2 Models

TL;DR

TagSpeech introduces an end-to-end multi-speaker ASR and diarization framework using temporal anchors and fine-grained timestamp prediction, improving accuracy and efficiency in complex overlapping speech scenarios.

Contribution

It presents a novel LLM-based approach with decoupled semantic and speaker streams, and an interleaved time anchor mechanism for precise speaker-content alignment.

Findings

01

Achieves lower Diarization Error Rate on AMI and AliMeeting datasets.

02

Handles complex speech overlaps more effectively than previous models.

03

Uses a parameter-efficient training paradigm with frozen LLM backbone.

Abstract

We present TagSpeech, a unified LLM-based framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that not only supports fine-grained timestamp prediction but also acts as a synchronization signal between semantic understanding and speaker tracking. Compared to previous works that primarily focus on speaker-attributed ASR or implicit diarization, TagSpeech addresses the challenge of fine-grained speaker-content alignment and explicitly models "who spoke what and when" in an end-to-end manner. Experiments on AMI and AliMeeting benchmarks demonstrate that our method achieves consistent improvements in Diarization Error Rate (DER) over strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling