Human MotionFormer: Transferring Human Motions with Vision Transformers
Hongyu Liu, Xintong Han, Chengbin Jin, Lihui Qian, Huawei, Wei, Zhe Lin, Faqiang Wang, Haoye Dong, Yibing Song, Jia Xu, and Qifeng Chen

TL;DR
Human MotionFormer introduces a hierarchical Vision Transformer framework that effectively captures both large and subtle human motion details for high-quality motion transfer, setting new state-of-the-art results.
Contribution
The paper proposes a novel hierarchical ViT architecture with global and local perception modules and a mutual learning loss for improved human motion transfer.
Findings
Achieves state-of-the-art performance in motion transfer quality.
Effectively captures both large and subtle motions.
Demonstrates superior qualitative and quantitative results.
Abstract
Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. An accurate matching between the source person and the target motion in both large and subtle motion changes is vital for improving the transferred motion quality. In this paper, we propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching, respectively. It consists of two ViT encoders to extract input features (i.e., a target motion image and a source human image) and a ViT decoder with several cascaded blocks for feature matching and motion transfer. In each block, we set the target motion feature as Query and the source person as Key and Value, calculating the cross-attention maps to conduct a global feature matching. Further, we introduce a convolutional layer to improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Advanced Vision and Imaging
