Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip   Transformer

Zichen Geng; Caren Han; Zeeshan Hayder; Jian Liu; Mubarak Shah and; Ajmal Mian

arXiv:2405.15439·cs.CV·May 27, 2024

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Zichen Geng, Caren Han, Zeeshan Hayder, Jian Liu, Mubarak Shah and, Ajmal Mian

PDF

Open Access

TL;DR

This paper introduces KeyMotion, a novel approach for text-guided 3D human motion generation that uses keyframes, a VAE, and a Parallel Skip Transformer to produce realistic sequences efficiently, outperforming existing methods.

Contribution

The paper presents a new framework combining keyframe generation, VAE-based latent space projection, and a Parallel Skip Transformer for improved text-guided human motion synthesis.

Findings

01

Achieves state-of-the-art results on HumanML3D dataset.

02

Outperforms others on R-precision and MultiModal Distance metrics.

03

Provides competitive performance on KIT dataset with top metrics.

Abstract

Text-driven human motion generation is an emerging task in animation and humanoid robot design. Existing algorithms directly generate the full sequence which is computationally expensive and prone to errors as it does not pay special attention to key poses, a process that has been the cornerstone of animation for decades. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text by first generating keyframes followed by in-filling. We use a Variational Autoencoder (VAE) with Kullback-Leibler regularization to project the keyframes into a latent space to reduce dimensionality and further accelerate the subsequent diffusion process. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the keyframe latents and text condition. To complete the motion sequence, we propose a text-guided…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · Hand Gesture Recognition Systems

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections