Multimodal Integration Challenges in Emotionally Expressive Child Avatars for Training Applications
Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Michael A. Riegler, P{\aa}l Halvorsen

TL;DR
This study develops a real-time multimodal avatar system combining Unreal Engine 5 and NVIDIA Omniverse Audio2Face to generate facial expressions from voice, highlighting the importance of audiovisual congruence in emotional recognition.
Contribution
It introduces a novel real-time architecture for emotionally expressive child avatars that integrates voice-driven facial animation with a focus on audiovisual alignment challenges.
Findings
Emotions like sadness and joy are recognized well, anger less so without audio.
Silencing mismatched voice improves perceived realism of avatars.
Audiovisual congruence significantly impacts emotional perception and avatar realism.
Abstract
Dynamic facial emotion is essential for believable AI-generated avatars, yet most systems remain visually static, limiting their use in simulations like virtual training for investigative interviews with abused children. We present a real-time architecture combining Unreal Engine 5 MetaHuman rendering with NVIDIA Omniverse Audio2Face to generate facial expressions from vocal prosody in photorealistic child avatars. Due to limited TTS options, both avatars were voiced using young adult female models from two systems to better fit character profiles, introducing a voice-age mismatch. This confound may affect audiovisual alignment. We used a two-PC setup to decouple speech generation from GPU-intensive rendering, enabling low-latency interaction in desktop and VR. A between-subjects study (N=70) compared audio+visual vs. visual-only conditions as participants rated emotional clarity,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face recognition and analysis · Social Robot Interaction and HRI
