Body of Her: A Preliminary Study on End-to-End Humanoid Agent
Tenglong Ao

TL;DR
This paper introduces a real-time, multimodal humanoid agent system that models speech, full-body movements, and manipulation, aiming to bridge the gap in realistic, interactive virtual humanoid agents.
Contribution
It presents a novel end-to-end neural network integrating audio-visual inputs for realistic, duplex humanoid agent behaviors, extending from a large pre-trained language model.
Findings
Demonstrates capabilities like generalized object manipulation.
Achieves real-time duplex communication.
Models full-body movements and facial expressions.
Abstract
Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively complete humanoid agent first needs to have face and body, then possess both verbal and non-verbal (such as eye contact, facial expression, lip motion, gesture, and manipulation) abilities, and finally, it is capable of real-time duplex communication, e.g., the ability to actively interrupt conversations. Most prior systems typically only consider a subset of these elements, leaving a gap from realistic humanoid agent. In this work, we propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors, including speech, full-body movements for talking, responding, idling, and manipulation. This system is a multimodal model integrating audio and visual inputs, extended from a pre-trained large language model (LLM). We collect approximately 200,000 hours…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Neuroethics, Human Enhancement, Biomedical Innovations · Utopian, Dystopian, and Speculative Fiction
