TL;DR
This paper introduces CoAMD, a unified model that links human action recognition and motion generation using skeleton data, achieving state-of-the-art results across multiple tasks.
Contribution
The work presents a novel unified framework with a multi-modal recognizer and diffusion-based motion synthesis, bridging the gap between understanding and generating human motion from text and skeleton data.
Findings
Achieves state-of-the-art performance on 13 benchmarks.
Effectively handles four tasks: recognition, generation, retrieval, and editing.
Demonstrates the versatility of skeleton-based motion modeling.
Abstract
Human action recognition and motion generation are two active research problems in human-centric computer vision, both aiming to align motion with textual semantics. However, most existing works study these two problems separately, without uncovering the links between them, namely that motion generation requires semantic comprehension. This work investigates unified action recognition and motion generation by leveraging skeleton coordinates for both motion understanding and generation. We propose Coordinates-based Autoregressive Motion Diffusion (CoAMD), which synthesizes motion in a coarse-to-fine manner. As a core component of CoAMD, we design a Multi-modal Action Recognizer (MAR) that provides gradient-based semantic guidance for motion generation. Furthermore, we establish a rigorous benchmark by evaluating baselines on absolute coordinates. Our model can be applied to four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
