Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
Joonhyung Park, Hyeongwon Jang, Joowon Kim, Eunho Yang

TL;DR
This paper introduces GridAR, a novel test-time scaling framework for autoregressive image generation that improves quality and efficiency by progressive, grid-based candidate generation and prompt reformulation.
Contribution
GridAR is a new test-time scaling method that enhances visual autoregressive models through grid-partitioned progressive generation and prompt reformulation strategies.
Findings
Outperforms Best-of-N with fewer computations.
Achieves 14.4% higher quality on T2I-CompBench++ with N=4.
Shows 13.9% improvement in semantic preservation in image editing.
Abstract
Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
