Exploring Vision Transformers for 3D Human Motion-Language Models with   Motion Patches

Qing Yu; Mikihiro Tanaka; Kent Fujiwara

arXiv:2405.04771·cs.CV·May 9, 2024

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

Qing Yu, Mikihiro Tanaka, Kent Fujiwara

PDF

Open Access

TL;DR

This paper introduces motion patches and leverages Vision Transformers with transfer learning to improve 3D human motion-language models, achieving state-of-the-art results despite limited motion data.

Contribution

The paper proposes a novel motion patch representation and demonstrates the effectiveness of ViT transfer learning for motion analysis tasks.

Findings

01

Motion patches are robust to skeleton variations.

02

Transfer learning with ViT improves motion analysis performance.

03

Achieves state-of-the-art in text-to-motion retrieval and other tasks.

Abstract

To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems