FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

Hugo Caselles-Dupr\'e (1); Mathis Koroglu (1; 2); Guillaume Jeanneret (2); Arnaud Dapogny (2); Matthieu Cord (2) ((1) Obvious Research; Paris; France; (2) Institute of Intelligent Systems; Robotics - Sorbonne University; Paris; France)

arXiv:2603.17555·cs.CV·March 19, 2026

FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

Hugo Caselles-Dupr\'e (1), Mathis Koroglu (1, 2), Guillaume Jeanneret (2), Arnaud Dapogny (2), Matthieu Cord (2) ((1) Obvious Research, Paris, France, (2) Institute of Intelligent Systems, Robotics - Sorbonne University, Paris, France)

PDF

Open Access

TL;DR

FrescoDiffusion is a training-free method that enhances 4K image-to-video generation by combining tiled denoising with a global latent prior, ensuring high-resolution detail and spatial-temporal coherence.

Contribution

It introduces a novel, training-free approach for large-format I2V generation that fuses tiled denoising with a precomputed global latent reference for improved coherence.

Findings

01

Improved global consistency and fidelity over tiled baselines.

02

Efficient 4K image-to-video generation with fine detail preservation.

03

Enables controllable trade-off between creativity and consistency.

Abstract

Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Advanced Vision and Imaging