Learning to Expand Images for Efficient Visual Autoregressive Modeling
Ruiqing Yang, Kaixin Zhang, Zheng Zhang, Shan You, Tao Huang

TL;DR
This paper introduces EAR, a biologically inspired image generation method that expands tokens from the center outward, enabling efficient parallel decoding and improved quality in autoregressive visual models.
Contribution
We propose EAR, a novel spiral expansion approach with adaptive decoding, improving efficiency and quality in autoregressive image generation.
Findings
Achieves state-of-the-art fidelity-efficiency trade-offs on ImageNet
Reduces computational cost compared to traditional token-by-token methods
Aligns generation order with perceptual relevance for better quality
Abstract
Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from inefficiency, either due to token-by-token decoding or the complexity of multi-scale representations. In this work, we introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system's center-outward perception pattern. EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding. To further enhance flexibility and speed, we propose a length-adaptive decoding strategy that dynamically adjusts the number of tokens predicted at each step. This biologically inspired design not only reduces computational cost but also improves generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
