Next Visual Granularity Generation
Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy

TL;DR
This paper introduces the Next Visual Granularity (NVG) framework for image generation, which builds images through a hierarchical, layered sequence that refines from global layout to details, enabling fine control and improved quality.
Contribution
The paper presents a novel NVG framework that generates images via a structured sequence, capturing multiple levels of visual granularity, and demonstrates its effectiveness on ImageNet with superior results.
Findings
NVG outperforms VAR in FID scores across multiple scales.
NVG enables hierarchical, layered image generation with fine control.
The framework scales well with dataset size and complexity.
Abstract
We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently…
Peer Reviews
Decision·ICLR 2026 Poster
This work explores the structural information for autoregressive-based image generation. It generates a structural mask by a cluster-based algorithm, without any pretrained method. The experimental results verify its effectiveness.
- The structured conditional guidance appears to have limited novelty, as similar ideas have been widely explored in text-to-image generation, such as in ControlNet [A]. - There is no ablation study evaluating the effectiveness of the structure and content guidance. In addition, qualitative comparisons with state-of-the-art methods are missing, as are qualitative results for the ablation studies. [A] Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. "Adding conditional control to text-to-image diff
* The reviewer likes the general idea of generating image from coarse-to-fine structure and content. * The model shows its scalability, indicating it can be a potential model for foundational generative models. * By leveraging the structure from reference image, the proposed method can conduct guided generation from reference image.
My major concern are related to some detail in the method and figures, the current manuscript makes the reviewer being confused. * From the Fig 2, it seems that the models (structure and content) are predicting the residual results (that will be aggregated to final result). However, in Fig 4, the output of the structure/content model are final structure/canvas (and ln 269 also mentioned that content generator directly generated final canvas). And the “Residual” is the difference between the fina
1. The proposed structure prediction follows an intuitive and interpretable coarse-to-fine process, aligning well with how humans perceive visual composition. 2. The framework effectively separates structure and content generation, enabling controllable and interpretable image synthesis. 3. The model demonstrates strong scalability and competitive performance across metrics such as FID and Inception Score compared to state-of-the-art baselines.
1. Semantic Misalignment in Clustering NVG constructs hierarchical structures via greedy clustering based on token embedding similarity. However, embedding-space similarity does not always reflect semantic consistency; tokens with similar color or texture may be grouped together even if they belong to different objects, leading to inaccurate granularity segmentation. 2. Increased Computational Cost Compared with VAR, NVG requires two sequential generation steps, structure and content at each sta
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGlaucoma and retinal disorders
