VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

Tengjiao Yin; Jinglei Shi; Heng Guo; Xi Wang

arXiv:2603.16271·cs.CV·March 24, 2026

VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang

PDF

Open Access

TL;DR

This paper introduces VIGOR, a geometry-based reward for improving the geometric consistency of generated videos by leveraging pretrained models and a robust, physically grounded error metric, enhancing video quality without retraining.

Contribution

We propose a novel geometry-oriented reward model that evaluates multi-view consistency in generated videos using a physically grounded error metric and a geometry-aware sampling strategy.

Findings

01

Our reward model improves robustness over existing metrics.

02

It enables effective inference-time scaling of video diffusion models.

03

Experimental results show enhanced geometric consistency in generated videos.

Abstract

Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Vision and Imaging