TL;DR
MIBURI is a novel online framework that generates expressive, synchronized full-body gestures and facial expressions for ECAs in real-time, using hierarchical motion encoding and causal autoregressive modeling.
Contribution
It introduces the first real-time, causal system for expressive gesture synthesis conditioned on speech, combining hierarchical motion encoding with LLM-based context understanding.
Findings
Produces natural, contextually aligned gestures in real-time.
Outperforms recent baselines in naturalness and expressiveness.
Enables expressive gestures without long run-time dependencies.
Abstract
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
