Gaze-Enhanced Multimodal Turn-Taking Prediction in Triadic Conversations
Seongsil Heo, Calvin Murdock, Michael Proulx, Christi Miller

TL;DR
This paper presents a lightweight, privacy-conscious framework that integrates gaze and speaker localization to improve turn-taking prediction in triadic conversations, enhancing speech intelligibility in noisy environments.
Contribution
It introduces a novel method combining gaze and spatial cues for turn-taking prediction without heavy computation, advancing multimodal interaction modeling.
Findings
Gaze data from a single user improves prediction accuracy.
Multi-user gaze data further enhances prediction performance.
The approach supports adaptive sound control in noisy environments.
Abstract
Turn-taking prediction is crucial for seamless interactions. This study introduces a novel, lightweight framework for accurate turn-taking prediction in triadic conversations without relying on computationally intensive methods. Unlike prior approaches that either disregard gaze or treat it as a passive signal, our model integrates gaze with speaker localization, structuring it within a spatial constraint to transform it into a reliable predictive cue. Leveraging egocentric behavioral cues, our experiments demonstrate that incorporating gaze data from a single-user significantly improves prediction performance, while gaze data from multiple-users further enhances it by capturing richer conversational dynamics. This study presents a lightweight and privacy-conscious approach to support adaptive, directional sound control, enhancing speech intelligibility in noisy environments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
