GHOST: Grounded Human Motion Generation with Open Vocabulary   Scene-and-Text Contexts

Zolt\'an \'A. Milacski; Koichiro Niinuma; Ryosuke Kawamura; Fernando; de la Torre; L\'aszl\'o A. Jeni

arXiv:2405.18438·cs.CV·May 30, 2024

GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts

Zolt\'an \'A. Milacski, Koichiro Niinuma, Ryosuke Kawamura, Fernando, de la Torre, L\'aszl\'o A. Jeni

PDF

Open Access

TL;DR

This paper introduces GHOST, a novel open vocabulary scene encoder for grounded human motion generation that leverages knowledge distillation and regularization to improve accuracy and flexibility in multi-modal context understanding.

Contribution

It proposes a two-step approach with pretraining via knowledge distillation and fine-tuning with novel regularization, enabling better text-scene connection and motion grounding.

Findings

01

Achieves up to 30% reduction in goal object distance metric.

02

Demonstrates improved performance over state-of-the-art on HUMANISE dataset.

03

Framework is adaptable to future segmentation methods.

Abstract

The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically, closed vocabulary scene encoders, which require learning text-scene associations from scratch, have been favored in the literature, often resulting in inaccurate motion grounding. In this paper, we propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model, ensuring a shared text-scene feature space. Subsequently,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Knowledge Distillation