Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

Yang Yang; Siming Zheng; Qirui Yang; Jinwei Chen; Boxi Wu; Xiaofei He; Deng Cai; Bo Li; Peng-Tao Jiang

arXiv:2505.21593·cs.CV·October 13, 2025

Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

Yang Yang, Siming Zheng, Qirui Yang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel diffusion-based method for generating temporally coherent, depth-aware video bokeh effects, addressing previous issues of flickering and lack of control, and establishing a new benchmark in controllable video depth-of-field rendering.

Contribution

It presents the first dedicated diffusion framework for video bokeh, utilizing multi-plane image conditioning and progressive training for improved stability and controllability.

Findings

01

Outperforms prior methods in temporal coherence and spatial accuracy.

02

Demonstrates effectiveness on synthetic and real-world benchmarks.

03

Provides controllable depth-of-field effects with enhanced stability.

Abstract

Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. **Innovative Approach:** The use of an MPI-guided conditioning mechanism in a one-step diffusion framework for video bokeh generation is both original and well-motivated. It effectively bridges the gap between static image refocusing and temporally coherent video refocusing. 2. **Temporal Coherence and Depth Awareness:** The focal-plane-adapted MPI representation efficiently balances detail preservation in focused regions and smooth transitions in defocused areas, improving visual cons

Weaknesses

1. **Dependence on Depth Estimation:** The method relies on pre-trained depth estimation models as input. In dynamic or complex scenes, depth errors may propagate into the final bokeh rendering. The paper would be stronger with a sensitivity analysis or ablation showing how depth inaccuracies affect output quality. 2. **Computational Efficiency:** While the results are impressive, the paper provides limited discussion on computational cost. Diffusion-based models are typically resource-in

Reviewer 02Rating 8Confidence 3

Strengths

1. The authors leverage the prior of a video diffusion model in a novel way, to perform temporally consistent, controllable video bokeh. 2. The authors showcase good bokeh results. They also provide a supplementary video with their results and comparisons to other methods, which is very important for the qualitative assessment of their claims for temporally consistent bokeh addition. 3. The authors show extensive quantitative evaluations, emphasizing their lead over other competing methods. 4.

Weaknesses

Major: 1. The authors do not provide limitations for their method. Are there any scenarios where the model fails to generate a good bokeh video? Maybe in videos with fast motion, such as a car race. Minor: 1. Figure 1 is not referenced. 2. SM figures 10,11 – red border not corresponding to zoom-in area.

Reviewer 03Rating 4Confidence 4

Strengths

- The authors propose a one-step diffusion framework for video bokeh rendering, which exhibits an efficiency advantage in inference time. - In the third stage, fine tune the VAE decoder and introduce texture loss based on image gradients to improve high-frequency texture and edge clarity, which helps ensure the presentation of details of the focused subject. - Time consistency and video quality indicators are significantly better than the baseline mentioned in the paper.

Weaknesses

- Comparison with Video Bokeh Methods. The authors only compared their with the image bokeh method, but it needs to be compared with video methods, such as VBR [1]. - Complex scenes. The authors employ a multi-plane image (MPI) representation, and this representation can bring challenges, such as whether to divide an object into two different layers. The authors should discuss this situation. - The robustness of this method. The authors should compare their full model with degraded depth maps an

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Digital Media Forensic Detection · Video Analysis and Summarization