Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation
Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu

TL;DR
This paper introduces a novel method to distill structure-preserving motion priors from an autoregressive video tracking model into a diffusion model, significantly improving realistic motion generation for articulated and deformable objects.
Contribution
It proposes a new distillation approach with a bidirectional feature fusion module and a Local Gram Flow loss, enhancing structure preservation in video generation.
Findings
Achieves 95.51% on VBench, surpassing previous methods.
Reduces FVD by over 21%, indicating more realistic video quality.
Gains 71.4% in human preference tests.
Abstract
Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
