Real-time Generation of Various Types of Nodding for Avatar Attentive Listening System
Kazushi Kato, Koji Inoue, Divesh Lala, Keiko Ochi, and Tatsuya Kawahara

TL;DR
This paper presents a real-time model for generating various types of nodding in avatar attentive listening systems, improving naturalness and synchronization over conventional methods.
Contribution
It extends the voice activity projection model to predict multiple nodding types in real time using multi-task learning and dialogue pretraining.
Findings
Multi-task learning improves nodding prediction accuracy.
Real-time operation achieved with minimal accuracy loss.
Outperforms conventional synchronized nodding methods.
Abstract
In human dialogue, nonverbal information such as nodding and facial expressions is as crucial as verbal information, and spoken dialogue systems are also expected to express such nonverbal behaviors. We focus on nodding, which is critical in an attentive listening system, and propose a model that predicts both its timing and type in real time. The proposed model builds on the voice activity projection (VAP) model, which predicts voice activity from both listener and speaker audio. We extend it to prediction of various types of nodding in a continuous and real-time manner unlike conventional models. In addition, the proposed model incorporates multi-task learning with verbal backchannel prediction and pretraining on general dialogue data. In the timing and type prediction task, the effectiveness of multi-task learning was significantly demonstrated. We confirmed that reducing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Social Robot Interaction and HRI · Face recognition and analysis
