MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training
Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Julian, Tanke, Shusuke Takahashi, Yuki Mitsufuji

TL;DR
MoLA is a novel framework that combines latent diffusion, adversarial training, and a variational autoencoder to enable fast, high-quality, and controllable text-to-motion generation and editing of variable-length motions.
Contribution
It introduces a unified approach for motion generation and editing using latent diffusion enhanced by adversarial training, with a new motion representation for variable-length outputs.
Findings
Adversarial training improves motion generation quality.
The framework supports multiple motion editing tasks.
Motion generation is faster and more controllable.
Abstract
In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provides fast, high-quality, variable-length motion generation and can also deal with multiple editing tasks in a single framework. Our approach revisits the motion representation used as inputs and outputs in the model, incorporating an activation variable to enable variable-length motion generation. Additionally, we integrate a variational autoencoder and a latent diffusion model, further enhanced through adversarial training, to achieve high-quality and fast generation. Moreover, we apply a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
