Native-Resolution Image Synthesis
Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang

TL;DR
This paper presents a new generative model, NiT, capable of synthesizing high-quality images at arbitrary resolutions and aspect ratios, surpassing fixed-resolution methods and demonstrating strong zero-shot generalization.
Contribution
Introduction of the Native-resolution diffusion Transformer (NiT), a novel architecture that models variable resolutions and aspect ratios within a single framework.
Findings
Achieves state-of-the-art results on ImageNet benchmarks.
Generates high-fidelity images at unseen high resolutions.
Demonstrates excellent zero-shot generalization to new resolutions and aspect ratios.
Abstract
We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
