Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

Zhaoqing Wang; Xiaobo Xia; Zhuolin Bie; Jinlin Liu; Dongdong Yu; Jia-Wang Bian; Changhu Wang

arXiv:2512.02870·cs.CV·December 3, 2025

Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu, Jia-Wang Bian, Changhu Wang

PDF

Open Access

TL;DR

This paper introduces an online reinforcement learning framework with a verifiable geometry reward to enhance camera-controlled video generation, achieving better accuracy, consistency, and quality over traditional supervised fine-tuning methods.

Contribution

It presents a novel RL post-training approach with a geometry-based reward for improved camera control in video generation, along with a new dataset for diverse camera motions.

Findings

01

Outperforms supervised fine-tuning in camera-control accuracy

02

Improves geometric consistency in generated videos

03

Enhances visual quality of camera-controlled videos

Abstract

Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition