Native-Resolution Image Synthesis

Zidong Wang; Lei Bai; Xiangyu Yue; Wanli Ouyang; Yiyuan Zhang

arXiv:2506.03131·cs.CV·June 4, 2025

Native-Resolution Image Synthesis

Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

This paper presents a new generative model, NiT, capable of synthesizing high-quality images at arbitrary resolutions and aspect ratios, surpassing fixed-resolution methods and demonstrating strong zero-shot generalization.

Contribution

Introduction of the Native-resolution diffusion Transformer (NiT), a novel architecture that models variable resolutions and aspect ratios within a single framework.

Findings

01

Achieves state-of-the-art results on ImageNet benchmarks.

02

Generates high-fidelity images at unseen high resolutions.

03

Demonstrates excellent zero-shot generalization to new resolutions and aspect ratios.

Abstract

We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
GoodEnough/NiT-XL-Models
model· 4 dl· ♡ 11
4 dl♡ 11

Datasets

GoodEnough/NiT-Preprocessed-ImageNet1K
dataset· 142 dl
142 dl

Videos

Native-Resolution Image Synthesis· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques