Allo-AVA: A Large-Scale Multimodal Conversational AI Dataset for Allocentric Avatar Gesture Animation
Saif Punjwani, Larry Heck

TL;DR
Allo-AVA is a large-scale, multimodal dataset designed to improve avatar gesture animation by providing synchronized speech, facial, and body movement data for virtual environment applications.
Contribution
The paper introduces Allo-AVA, a comprehensive dataset with 1,250 hours of annotated video content for text and audio-driven avatar gesture animation in third-person view.
Findings
Enables development of more natural avatar animations.
Provides synchronized multimodal data for training AI models.
Facilitates research in virtual reality and digital assistants.
Abstract
The scarcity of high-quality, multimodal training data severely hinders the creation of lifelike avatar animations for conversational AI in virtual environments. Existing datasets often lack the intricate synchronization between speech, facial expressions, and body movements that characterize natural human communication. To address this critical gap, we introduce Allo-AVA, a large-scale dataset specifically designed for text and audio-driven avatar gesture animation in an allocentric (third person point-of-view) context. Allo-AVA consists of 1,250 hours of diverse video content, complete with audio, transcripts, and extracted keypoints. Allo-AVA uniquely maps these keypoints to precise timestamps, enabling accurate replication of human movements (body and facial gestures) in synchronization with speech. This comprehensive resource enables the development and evaluation of more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Multimodal Machine Learning Applications
