Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Yang Fei; George Stoica; Jingyuan Liu; Qifeng Chen; Ranjay Krishna; Xiaojuan Wang; Benlin Liu

arXiv:2512.11792·cs.CV·December 15, 2025

Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu

PDF

Open Access

TL;DR

This paper introduces a novel method to distill structure-preserving motion priors from an autoregressive video tracking model into a diffusion model, significantly improving realistic motion generation for articulated and deformable objects.

Contribution

It proposes a new distillation approach with a bidirectional feature fusion module and a Local Gram Flow loss, enhancing structure preservation in video generation.

Findings

01

Achieves 95.51% on VBench, surpassing previous methods.

02

Reduces FVD by over 21%, indicating more realistic video quality.

03

Gains 71.4% in human preference tests.

Abstract

Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis