Human Motion Diffusion as a Generative Prior
Yonatan Shafir, Guy Tevet, Roy Kapon, Amit H. Bermano

TL;DR
This paper advances human motion generation by introducing diffusion-based composition methods—sequential, parallel, and model blending—that enable long sequences, multi-person interactions, and detailed control, overcoming data and flexibility limitations.
Contribution
It proposes novel diffusion composition techniques for long, multi-person, and controllable human motion generation, including DoubleTake, ComMDM, and DiffusionBlending, with comprehensive evaluation.
Findings
DoubleTake enables long animation generation from short clips.
ComMDM facilitates two-person motion coordination.
DiffusionBlending allows flexible motion editing and control.
Abstract
Recent work has demonstrated the significant potential of denoising diffusion models for generating human motion, including text-to-motion capabilities. However, these methods are restricted by the paucity of annotated motion data, a focus on single-person motions, and a lack of detailed control. In this paper, we introduce three forms of composition based on diffusion priors: sequential, parallel, and model composition. Using sequential composition, we tackle the challenge of long sequence generation. We introduce DoubleTake, an inference-time method with which we generate long animations consisting of sequences of prompted intervals and their transitions, using a prior trained only for short clips. Using parallel composition, we show promising steps toward two-person generation. Beginning with two fixed priors as well as a few two-person training examples, we learn a slim…
Peer Reviews
Decision·ICLR 2024 poster
Overall, the paper is well-written with a clear and well-motivated introduction. The proposed method outperforms previous specialized techniques in the respective task. The experimental designs are comprehensive and show visually appealing results.
1. For the generation of long sequences, I don't think it makes sense to essentially generate each interval completely independently. A better option would be to use an autoregressive generation method similar to TEACH, but with the smarter option of combining each subsequence. Would it be possible to compare this with a scheme similar to EDGE [1]? Also, the paper admits that a comparison with DiffCollage is not possible due to a lack of publicly available code resources. Their implementation is
I want to highlight the following strengths: - The main strength I see is that the methods presented work well without the need of generating more data (or consuming large amounts of unavailable data). This is a strong benefit, since the field of human motion generation is still lagging in terms of data availability. The authors demonstrate in all 3 cases that they can satisfy the task at hand requiring small amounts of extra data/training. - I see major novelty in the methods developed for long
My only concern is on the fine-tuned motion control part. The task seems very similar to controlled motion generation. In that case, there is a body of literature in this subject, many of which uses diffusion models for controlled motion generation. The authors failed to include these methods and compare against them. Of course these methods have different data requirements, but they seem to achieve the same goal. I put a list of these methods below. I ask the authors to explain why they did not
1. The paper presents compelling quantitative and qualitative results, setting a new state-of-the-art benchmark with a significant lead. 2. This article broadens the scope of existing text-driven motion generation from three perspectives. The conclusions and experiments associated with these extensions are valuable contributions to the research community. 3. The paper is well written, ensuring that its content is readily comprehensible to its readers.
1. The authors should conduct a user study to quantitatively compare the visual results of TEACH with the proposed method for long sequence generation. Model parameters and inference speed should also be provided for a more comprehensive performance comparison between the two. 2. Dual-person motion generation lacks comparative experiments, for example, with InterGen \[1\]. This paper introduces the InterHuman benchmark, a large-scale dataset for dual-person motion, and provides more comparative
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
