VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du; Junjie Ye; Xiaoyan Cong; Runhao Li; Jingcheng Ni; Aman Agarwal; Zeqi Zhou; Zekun Li; Randall Balestriero; Yue Wang

arXiv:2601.23286·cs.CV·May 13, 2026

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

PDF

TL;DR

VideoGPA introduces a self-supervised framework that improves 3D structural consistency in video diffusion models by leveraging geometry priors and preference optimization, leading to more stable and plausible video generation.

Contribution

It presents a novel, data-efficient method that guides video diffusion models toward 3D consistency without human annotations using geometry-based preference signals.

Findings

01

Enhances temporal stability and geometric plausibility in generated videos.

02

Outperforms state-of-the-art baselines in experiments.

03

Requires minimal preference pairs for effective guidance.

Abstract

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, geometric plausibility, and motion coherence using minimal preference pairs, consistently outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.