G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

Jing Peng; Ziyi Chen; Haoyu Li; Yucheng Wang; Duo Ma; Mengtian Li; Yunfan Du; Dezhu Xu; Kai Yu; Shuai Wang

arXiv:2603.10468·eess.AS·March 12, 2026

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

Jing Peng, Ziyi Chen, Haoyu Li, Yucheng Wang, Duo Ma, Mengtian Li, Yunfan Du, Dezhu Xu, Kai Yu, Shuai Wang

PDF

Open Access

TL;DR

G-STAR is an end-to-end system that combines speaker tracking and speech transcription to produce time-stamped, speaker-attributed transcripts for multi-party meetings, addressing limitations of previous systems.

Contribution

It introduces G-STAR, a novel integrated framework that jointly models speaker tracking and speech transcription, enabling robust, fine-grained speaker attribution in complex meeting scenarios.

Findings

01

Effective speaker tracking with temporal grounding.

02

Improved speaker attribution accuracy over baselines.

03

Flexible training under diverse supervision.

Abstract

We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritize either local diarization or global labeling, but often lack the ability to capture fine-grained temporal boundaries or robust cross-chunk identity linking. We propose G-STAR, an end-to-end system that couples a time-aware speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports both component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Experiments analyze cue fusion, local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems