SentiAvatar: Towards Expressive and Interactive Digital Humans
Chuhao Jin, Rui Zhang, Qingzhe Gao, Haoyu Shi, Dayu Wu, Yichen Jiang, Yihan Wu, Ruihua Song

TL;DR
SentiAvatar is a framework for creating expressive, interactive 3D digital humans that synchronize speech, gestures, and emotions in real time, leveraging large-scale multimodal data and a novel motion generation architecture.
Contribution
It introduces a new multimodal dialogue dataset, a pre-trained motion foundation model, and an audio-aware motion generation architecture for realistic digital humans.
Findings
Achieved state-of-the-art results on SuSuInterActs and BEATv2 datasets.
Generated 6 seconds of motion in 0.3 seconds with multi-turn streaming.
Produced highly synchronized speech, gestures, and expressions in real time.
Abstract
We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
