CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback
Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu, Xingye Tian, Xin Tao, Ying-Cong Chen

TL;DR
CamPilot enhances camera control in video diffusion models by introducing a 3D decoder and reward feedback mechanism that improves alignment accuracy and efficiency in video generation.
Contribution
We propose a novel camera-aware 3D decoder and reward feedback approach that significantly improves camera controllability in video diffusion models.
Findings
Effective camera control demonstrated on RealEstate10K and WorldScore benchmarks.
Improved video-camera alignment accuracy.
Reduced computational overhead in reward computation.
Abstract
Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Introduces a feed-forward 3D Gaussian-based decoder that efficiently evaluates camera-video alignment without reliance on computationally intensive post-processing tools like COLMAP. 2. Applies reward feedback learning (ReFL) to optimize camera adherence, which represents a previously underexplored direction in video diffusion. 3. Enables high-quality 3D scene reconstruction directly from video latents and camera poses, bypassing computationally expensive per-scene optimization.
1. The evaluation is limited to static scene datasets, this potentially limits its applicability for real-world video generation tasks. 2. The method assumes precise extrinsic and intrinsic camera parameters are available, which may not be a valid assumption in real-world applications.
1. Originality This paper addresses the under-explored problem of enforcing camera conditioning in video diffusion models using ReFL, thereby improving alignment between generated footage and prescribed camera parameters. 2. Quality The proposed approach incorporates a camera-aware 3D decoder that efficiently evaluates video–camera consistency while reducing computational overhead. Experimental results demonstrate clear gains in both camera control accuracy and overall visual quality. 3. Clari
1. The paper provides limited insight into how the method scales to larger datasets or real-world deployments; a more detailed analysis of computational requirements and potential bottlenecks would strengthen its practical relevance. 2. By relying on 3DGS, the approach is inherently restricted to static scenes, as the authors acknowledge, limiting its applicability to dynamic or non-rigid environments. 3. The pixel-level reward signal may be too low-level to capture high-order semantics, potenti
1. Using 3DGS to enhance camera-guided video generation is a good starting point
1. The writing quality of this paper needs improvement. Many expressions are verbose and not concise enough. For example, Sections 2.2 and 2.3 contain multiple repetitions of earlier content, resulting in unnecessary length. Moreover, the core comparison experiment showing how much computational cost is reduced compared to VAE decoding is placed only in Appendix A.4. From the main text alone, the experimental details are unclear. 2. The baseline methods used for comparison are somewhat outdated
- Novelty of using reward feedback for camera control: I have not seen a work that uses 3DGS to improve the underlying video model and its 3D consistency/camera control precision. The direction seems promising.
- Camera-aware 3DGS decoder already proposed: The first point of the contribution list claims the camera-aware 3DGS decoder to be a contribution. However, the approach just follows Wonderland [1] without changes. The claim of the whole pipeline seems to be a big overclaim and the authors should have acknowledged Wonderland a lot more during the paper. Wonderland is mentioned and referenced in the paper, but there are no differences in the approaches. The main point of the paper is the reward-bas
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Video Coding and Compression Technologies
