Generative Photographic Control for Scene-Consistent Video Cinematic Editing
Huiqiang Sun, Liao Shen, Zhan Peng, Kun Wang, Size Wu, Yuhang Zang, Tianqi Liu, Zihao Huang, Xingyu Zeng, Zhiguo Cao, Wei Li, Chen Change Loy

TL;DR
CineCtrl is a novel video editing framework that enables precise control over photographic camera effects like bokeh and shutter speed, enhancing cinematic storytelling in generative videos.
Contribution
The paper introduces a new framework with a decoupled attention mechanism for independent control of camera parameters in videos, supported by a large-scale dataset from simulated effects.
Findings
High-fidelity video generation with controlled photographic effects
Effective disentanglement of camera motion from photographic inputs
Robust model performance demonstrated through extensive experiments
Abstract
Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
### Originality 1. The paper explores an under-addressed aspect of controllable video translation—photographic controls such as bokeh, exposure, color temperature, and focal length—rather than only camera trajectories. This significantly broadens the scope of controllable video translation. 2. The data side is also non-trivial: physically inspired simulation for 4 kinds of photographic effects + a real-data curation pipeline so the model doesn’t overfit to synthetic depth or simple zooms. ### T
1. **Limited Novelty and Missing Comparison.** The proposed *Camera-Decoupled Cross-Attention* is conceptually similar to existing decoupled cross-attention mechanisms such as IP-Adapter (Sec. 3.2.2), but the paper does not clearly explain the differences or cite related works, reducing the perceived originality. 2. **Data Pipeline Reliability.** The data synthesis pipeline depends heavily on depth estimation (“Video Depth Anything”) and bokeh simulation, both of which are error-prone and
1. The paper explores an interesting and relatively unexplored direction—adding fine-grained photographic control to video editing. 2. It includes a reasonable data collection pipeline and a simple module design that yields some improvements over baselines. 3. The overall writing and presentation are clear.
1. Overstated novelty and incomplete related work discussion. The paper emphasizes its novelty but omits several closely related works, especially in the image domain (e.g., arXiv:2412.02168 ), which already demonstrate strong results on similar photographic controls. The paper briefly dismisses these methods as “text-conditioned,” but this difference seems superficial, since both textual and numerical conditions are ultimately embedded as vectors. While I acknowledge the novelty on the video si
1. This is the first unified framework for video photography effect editing, extending the idea of Generative Photography (Yuan et al. CVPR 2025) to the video domain. It enables joint control over multiple camera parameters — bokeh, focal length, exposure, and color temperature. 2. The proposed approach has strong potential in video editing, generative AI for photography, and visual effects applications.
1. The disentanglement analysis is not deep enough. In Eq. (5), features from camera intrinsics and extrinsics are directly added — how can this design theoretically achieve disentanglement? 2. There is no clear design for disentangling multiple intrinsics (e.g., focal length vs. exposure). How can these parameters be independently controlled without interference? The paper and supplement lack examples showing the same source video with multiple intrinsics changed simultaneously. 3. The paper
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Visual Attention and Saliency Detection
