A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation
Edward Effendy, Kuan-Wei Tseng, Rei Kawakami

TL;DR
This paper introduces a transformer-based framework for whole-body grasping motion generation, combining pose creation, motion smoothing, and joint refinement, enhanced by a novel pretraining approach on large datasets.
Contribution
It proposes a unified transformer-based pipeline with a new pretraining strategy for improved whole-body grasping motion generation.
Findings
Outperforms state-of-the-art in coherence, stability, and realism
Effective transfer of learned representations to grasping tasks
Modular design adaptable to other human-motion applications
Abstract
Accepted in the ICIP 2025 We present a novel transformer-based framework for whole-body grasping that addresses both pose generation and motion infilling, enabling realistic and stable object interactions. Our pipeline comprises three stages: Grasp Pose Generation for full-body grasp generation, Temporal Infilling for smooth motion continuity, and a LiftUp Transformer that refines downsampled joints back to high-resolution markers. To overcome the scarcity of hand-object interaction data, we introduce a data-efficient Generalized Pretraining stage on large, diverse motion datasets, yielding robust spatio-temporal representations transferable to grasping tasks. Experiments on the GRAB dataset show that our method outperforms state-of-the-art baselines in terms of coherence, stability, and visual realism. The modular design also supports easy adaptation to other human-motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Motor Control and Adaptation · Human Pose and Action Recognition
MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer
