TL;DR
This paper investigates how predictive turn-taking models perform in noisy environments, demonstrating that multimodal models incorporating visual cues are more robust than audio-only models, though their effectiveness varies with noise type.
Contribution
The study introduces a multimodal PTTM that leverages visual cues to improve robustness in noisy conditions, and provides insights into training challenges with noisy data and transcriptions.
Findings
Multimodal PTTM achieves 72% accuracy in 10 dB music noise.
Audio-only PTTM accuracy drops to 52% in noisy conditions.
Training effectiveness depends on accurate transcriptions, limiting ASR use in noisy environments.
Abstract
Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
