VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
Zhaochong An, Orest Kupyn, Th\'eo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla

TL;DR
VGGRPO introduces a latent geometry-guided framework that enhances geometric consistency in video generation, especially for dynamic scenes, without compromising pretrained model capabilities or incurring high computational costs.
Contribution
The paper proposes VGGRPO, a novel latent-space geometry-guided approach utilizing a 4D reconstruction model and reinforcement learning rewards to improve world consistency in video generation.
Findings
Improves camera stability and geometric coherence in static and dynamic videos.
Eliminates the need for repeated VAE decoding, reducing computational overhead.
Enhances overall video quality and consistency across benchmarks.
Abstract
Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
