TL;DR
ScrollScape transforms ultra-high-resolution image synthesis into a video generation problem, leveraging video priors to maintain structural integrity at 32K resolution with extreme aspect ratios.
Contribution
It introduces a novel framework that reformulates EAR image synthesis as video generation, utilizing spatial-temporal mapping and super-resolution priors for unprecedented scale and quality.
Findings
Outperforms existing baselines by reducing artifacts.
Achieves 32K resolution with global coherence.
Effectively aligns video priors with high-resolution image synthesis.
Abstract
While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation. This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions. To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations. By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity. Specifically, Scanning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
