ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Sungho Koh; SeungJu Cha; Hyunwoo Oh; Kwanyoung Lee; Dong-Jin Kim

arXiv:2510.25818·cs.LG·October 31, 2025

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, Dong-Jin Kim

PDF

1 Video

TL;DR

ScaleDiff is a versatile, training-free framework that enhances the resolution of pretrained diffusion models efficiently, utilizing Neighborhood Patch Attention, Latent Frequency Mixing, and Structure Guidance to improve image quality and detail.

Contribution

It introduces ScaleDiff, a model-agnostic, training-free method that significantly improves high-resolution image synthesis in diffusion models using novel attention and guidance techniques.

Findings

01

Achieves state-of-the-art results among training-free methods.

02

Improves image quality and inference speed.

03

Effective across U-Net and Diffusion Transformer architectures.

Abstract

Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion· slideslive