Multimodal Integration Challenges in Emotionally Expressive Child Avatars for Training Applications

Pegah Salehi; Sajad Amouei Sheshkal; Vajira Thambawita; Michael A. Riegler; P{\aa}l Halvorsen

arXiv:2506.13477·cs.HC·July 9, 2025

Multimodal Integration Challenges in Emotionally Expressive Child Avatars for Training Applications

Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Michael A. Riegler, P{\aa}l Halvorsen

PDF

Open Access

TL;DR

This study develops a real-time multimodal avatar system combining Unreal Engine 5 and NVIDIA Omniverse Audio2Face to generate facial expressions from voice, highlighting the importance of audiovisual congruence in emotional recognition.

Contribution

It introduces a novel real-time architecture for emotionally expressive child avatars that integrates voice-driven facial animation with a focus on audiovisual alignment challenges.

Findings

01

Emotions like sadness and joy are recognized well, anger less so without audio.

02

Silencing mismatched voice improves perceived realism of avatars.

03

Audiovisual congruence significantly impacts emotional perception and avatar realism.

Abstract

Dynamic facial emotion is essential for believable AI-generated avatars, yet most systems remain visually static, limiting their use in simulations like virtual training for investigative interviews with abused children. We present a real-time architecture combining Unreal Engine 5 MetaHuman rendering with NVIDIA Omniverse Audio2Face to generate facial expressions from vocal prosody in photorealistic child avatars. Due to limited TTS options, both avatars were voiced using young adult female models from two systems to better fit character profiles, introducing a voice-age mismatch. This confound may affect audiovisual alignment. We used a two-PC setup to decouple speech generation from GPU-intensive rendering, enabling low-latency interaction in desktop and VR. A between-subjects study (N=70) compared audio+visual vs. visual-only conditions as participants rated emotional clarity,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face recognition and analysis · Social Robot Interaction and HRI