GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts
Zolt\'an \'A. Milacski, Koichiro Niinuma, Ryosuke Kawamura, Fernando, de la Torre, L\'aszl\'o A. Jeni

TL;DR
This paper introduces GHOST, a novel open vocabulary scene encoder for grounded human motion generation that leverages knowledge distillation and regularization to improve accuracy and flexibility in multi-modal context understanding.
Contribution
It proposes a two-step approach with pretraining via knowledge distillation and fine-tuning with novel regularization, enabling better text-scene connection and motion grounding.
Findings
Achieves up to 30% reduction in goal object distance metric.
Demonstrates improved performance over state-of-the-art on HUMANISE dataset.
Framework is adaptable to future segmentation methods.
Abstract
The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically, closed vocabulary scene encoders, which require learning text-scene associations from scratch, have been favored in the literature, often resulting in inaccurate motion grounding. In this paper, we propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model, ensuring a shared text-scene feature space. Subsequently,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Knowledge Distillation
