A Hierarchical End-of-Turn Model with Primary Speaker Segmentation for Real-Time Conversational AI

Karim Helwani; Hoang Do; James Luan; and Sriram Srinivasan

arXiv:2603.13379·cs.LG·March 17, 2026

A Hierarchical End-of-Turn Model with Primary Speaker Segmentation for Real-Time Conversational AI

Karim Helwani, Hoang Do, James Luan, and Sriram Srinivasan

PDF

Open Access

TL;DR

This paper introduces a real-time hierarchical model for conversational AI that accurately detects turn boundaries and primary speakers in multi-speaker environments, enabling more natural interactions with low latency and computational efficiency.

Contribution

The work presents a novel hierarchical, causal EOT model combined with primary speaker segmentation, optimized for real-time, edge deployment in multi-speaker conversational AI systems.

Findings

01

Achieves 82% multi-class frame-level F1 in speaker segmentation.

02

Reaches 87.7% recall on turn detection with 36 ms latency.

03

Reduces model size to 1.14 million parameters, outperforming transformer baselines.

Abstract

We present a real-time front-end for voice-based conversational AI to enable natural turn-taking in two-speaker scenarios by combining primary speaker segmentation with hierarchical End-of-Turn (EOT) detection. To operate robustly in multi-speaker environments, the system continuously identifies and tracks the primary user, ensuring that downstream EOT decisions are not confounded by background conversations. The tracked activity segments are fed to a hierarchical, causal EOT model that predicts the immediate conversational state by independently analyzing per-speaker speech features from both the primary speaker and the bot. Simultaneously, the model anticipates near-future states ( $t + 10/20/30$ \,ms) through probabilistic predictions that are aware of the conversation partner's speech. Task-specific knowledge distillation compresses wav2vec~2.0 representations (768\,D) into a compact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing