Making Pose Representations More Expressive and Disentangled via Residual Vector Quantization
Sukhyun Jeong, Hong-Gi Shin, Yong-Hoon Choi

TL;DR
This paper introduces a novel residual vector quantization approach to enhance pose code representations, making them more expressive and disentangled for improved controllable 3D human motion generation.
Contribution
We propose augmenting pose code-based latent representations with continuous features using RVQ, balancing interpretability with capturing subtle motion details.
Findings
Reduced FID from 0.041 to 0.015 on HumanML3D
Improved Top-1 R-Precision from 0.508 to 0.510
Enhanced controllability for motion editing
Abstract
Recent progress in text-to-motion has advanced both 3D human motion generation and text-based motion control. Controllable motion generation (CoMo), which enables intuitive control, typically relies on pose code representations, but discrete pose codes alone cannot capture fine-grained motion details, limiting expressiveness. To overcome this, we propose a method that augments pose code-based latent representations with continuous motion features using residual vector quantization (RVQ). This design preserves the interpretability and manipulability of pose codes while effectively capturing subtle motion characteristics such as high-frequency details. Experiments on the HumanML3D dataset show that our model reduces Frechet inception distance (FID) from 0.041 to 0.015 and improves Top-1 R-Precision from 0.508 to 0.510. Qualitative analysis of pairwise direction similarity between pose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Robot Manipulation and Learning
