Universal Approximation of Visual Autoregressive Transformers
Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song

TL;DR
This paper proves that simple Visual Autoregressive (VAR) transformers are universal approximators for image-to-image functions, outperforming previous methods and guiding future design of efficient image generation models.
Contribution
It establishes the universality of single-head VAR transformers with minimal layers, providing theoretical foundations and design principles for advanced image synthesis models.
Findings
VAR transformers outperform previous image synthesis methods
Single-layer VAR transformers are universal approximators
Flow-based autoregressive transformers share similar capabilities
Abstract
We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any image-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual perception and processing mechanisms · Color Science and Applications
