Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising

Yifan Wang; Liya Ji; Zhanghan Ke; Harry Yang; Ser-Nam Lim; Qifeng Chen

arXiv:2511.14719·cs.CV·November 19, 2025

Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising

Yifan Wang, Liya Ji, Zhanghan Ke, Harry Yang, Ser-Nam Lim, Qifeng Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a zero-shot method for enhancing synthetic video realism by conditioning diffusion models on structure-aware information, achieving high structural and photorealistic quality without fine-tuning.

Contribution

The authors develop a zero-shot framework that enhances synthetic videos by conditioning on estimated structural information, outperforming existing methods in structural consistency and realism.

Findings

01

Outperforms existing baselines in structural consistency

02

Maintains state-of-the-art photorealism quality

03

Operates without further fine-tuning

Abstract

We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper tackles a practically relevant problem of improving the realism of synthetic driving videos to reduce the sim-to-real gap, which is important for autonomous driving research. The motivation is clear, particularly the focus on preserving small, safety-critical objects such as traffic lights and signs while enhancing overall visual fidelity. 2. The proposed pipeline is conceptually simple and easy to follow, combining well-known diffusion techniques like DDIM inversion and ControlNet

Weaknesses

1. Novelty is limited / largely engineering of known pieces. DDIM inversion, CFG steering, ControlNet conditioning on depth/seg/edges, and video diffusion backbones are all established. FateZero-style zero-shot editing already uses inversion to preserve structure. The paper does not convincingly argue for a fundamentally new algorithm beyond “we combine these for CARLA videos.” 2. No downstream AV task evaluation. The primary stated motivation is improving autonomous driving models trained on sy

Reviewer 02Rating 2Confidence 4

Strengths

1. Practical Engineering: The paper presents a well-integrated pipeline that combines several state-of-the-art techniques, demonstrating solid engineering and implementation. 2. Clarity: The methodology is clearly described, and the paper is easy to follow for readers familiar with diffusion models and video synthesis. 3. Significance: The task of synthetic video realism enhancement is important for applications in virtual production, simulation, and content creation.

Weaknesses

1. Lack of Novelty: The core components, i.e. CFG, latent inversion, ControlNet, and EDM, are not new, and the paper does not offer significant innovation in how they are applied. The techniques used are well-established and widely applied in similar contexts, and the paper does not introduce novel algorithms or insights beyond their combination. 2. Missing Ablation Study: There is no ablation analysis to isolate the contribution of each component, which makes it difficult to assess the effectiv

Reviewer 03Rating 4Confidence 4

Strengths

- The proposed method introduces a zero-shot, training-free realism enhancement framework that effectively reduces the training cost of the model. - Experimental results show that the proposed method maintains consistency between the input and output videos for key objects such as traffic signs and traffic lights.

Weaknesses

1. The contributions of this paper are somewhat limited. The idea of using an inversion DDIM process has already been adopted by many existing methods, such as AnyV2V[1] and WAVE[2]. The authors should emphasize, from a methodological perspective, the theoretical advantages of the proposed inversion DDIM compared with other existing inversion schemes. In addition, experimental comparisons between the proposed baseline and other inversion DDIM baselines are needed to demonstrate the superiority a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Coding and Compression Technologies · Advanced Vision and Imaging