A Hierarchical End-of-Turn Model with Primary Speaker Segmentation for Real-Time Conversational AI
Karim Helwani, Hoang Do, James Luan, and Sriram Srinivasan

TL;DR
This paper introduces a real-time hierarchical model for conversational AI that accurately detects turn boundaries and primary speakers in multi-speaker environments, enabling more natural interactions with low latency and computational efficiency.
Contribution
The work presents a novel hierarchical, causal EOT model combined with primary speaker segmentation, optimized for real-time, edge deployment in multi-speaker conversational AI systems.
Findings
Achieves 82% multi-class frame-level F1 in speaker segmentation.
Reaches 87.7% recall on turn detection with 36 ms latency.
Reduces model size to 1.14 million parameters, outperforming transformer baselines.
Abstract
We present a real-time front-end for voice-based conversational AI to enable natural turn-taking in two-speaker scenarios by combining primary speaker segmentation with hierarchical End-of-Turn (EOT) detection. To operate robustly in multi-speaker environments, the system continuously identifies and tracks the primary user, ensuring that downstream EOT decisions are not confounded by background conversations. The tracked activity segments are fed to a hierarchical, causal EOT model that predicts the immediate conversational state by independently analyzing per-speaker speech features from both the primary speaker and the bot. Simultaneously, the model anticipates near-future states (\,ms) through probabilistic predictions that are aware of the conversation partner's speech. Task-specific knowledge distillation compresses wav2vec~2.0 representations (768\,D) into a compact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
