LEO: Generative Latent Image Animator for Human Video Synthesis
Yaohui Wang, Xin Ma, Xinyuan Chen, Cunjian Chen, Antitza Dantcheva, Bo, Dai, Yu Qiao

TL;DR
LEO introduces a flow-based framework for human video synthesis that emphasizes spatio-temporal coherency by disentangling motion from appearance, enabling high-quality, coherent, and editable human videos.
Contribution
The paper proposes LEO, a novel flow-based model that effectively separates motion from appearance, improving human video synthesis and enabling infinite-length generation and content-preserving editing.
Findings
Significantly improves spatio-temporal coherence in human videos.
Enables infinite-length human video synthesis.
Supports content-preserving video editing.
Abstract
Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator…
Peer Reviews
Decision·Submitted to ICLR 2024
- The application of synthesizing videos of arbitrary length is relevant and challenging. - The main idea is simple and clearly presented. - The quantitative and qualitative results showcase the efficacy of the proposed model over the baselines on the TaichiHD, FaceForensics and CelebV-HQ datasets.
- It would be nice to see some human-specific baselines, especially since the focus of the paper is on humans, e.g., utilizing skeleton/3DMM guidance. - I believe a comparison (or at least discussion) to video-ldm [1] would be beneficial. - I am missing a section on the limitations and ethical considerations. [1] Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
The paper is well-written and easy to follow, and the proposed solution sounds solid. Particularly: - Novel formulation of diffusion-based generative model for optical flow generations, which enables long-term motion generation - Explicit disentanglement of the video into appearance (pixel values) and motion (optical flow) that makes LEO better preserve the identity information in the input. - Auto-regressive motion generation with careful designs that achieve long-term video generation. The qu
While showing promising results, LEO has some limitations, which are also observed in other baselines: - Geometry ambiguity: without any explicit notion of 3D geometry or semantic features, LEO often flips or morphs the limbs from one side to the other. This is particularly obvious in the TaichiHD videos. - Temporal coherency: while LEO improves greatly over the other baselines compared in the paper, the appearance can still drift off/morph arbitrarily between frames, especially for videos with
1. This work tries to solve the challenging issue of disentangling motion from appearance. The method is well-motivated and the proposal method is simple to understand. 2. A Linear Motion Condition (LMC) mechanism is designed in cLMDM to condition the generative process with the first motion code α1. 3. Qualitative results show the ability to generate long videos and enable disentanglement of motion and appearance.
1. The author only includes pickup methods for comparison, STOA methods are not included for comparison. Recent methods, such as MoStGAN-V, VDM, Video-LDM, VideoFactory, and Make-A-Video, should be included for comparison. 2. The author should include experiments on more challenging datasets, such as MSR-VTT and UCF101.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
MethodsDiffusion · Temporal Jittering
