Visual Implicit Autoregressive Modeling

Pengfei Jiang; Jixiang Luo; Luxi Lin; Zhaohong Huang; Xuelong Li

arXiv:2605.01220·cs.CV·May 5, 2026

Visual Implicit Autoregressive Modeling

Pengfei Jiang, Jixiang Luo, Luxi Lin, Zhaohong Huang, Xuelong Li

PDF

TL;DR

VIAR introduces an implicit equilibrium layer in visual autoregressive models, enabling efficient, high-quality image generation with adjustable compute and memory, outperforming many existing methods.

Contribution

The paper proposes VIAR, a novel implicit autoregressive model with Jacobian-Free Backpropagation, reducing memory and increasing efficiency while maintaining or surpassing state-of-the-art quality.

Findings

01

VIAR achieves FID 2.16 on ImageNet 256x256.

02

VIAR reduces peak memory from 19.24 GB to 8.53 GB.

03

VIAR doubles throughput to 32.08 images/sec on a single RTX 4090.

Abstract

Visual Autoregressive Modeling (VAR) based on next-scale prediction achieves strong generation quality, but their explicit deep stacks fix the amount of computation per scale and inflate memory at high resolutions. We introduce Visual Implicit Autoregressive Modeling (VIAR), a next-scale autoregressive generator that embeds an implicit equilibrium layer between shallow pre/post blocks. The implicit layer is trained with Jacobian-Free Backpropagation, yielding constant training memory, while inference exposes a per-scale iteration knob that enables compute control. On ImageNet 256x256 benchmark, VIAR attains FID 2.16, and sFID 8.07 with only 38.4% parameters of VAR, matching or surpassing strong AR baselines and remaining competitive with large diffusion models. By controlling the per-scale knob, VIAR can reduce peak memory from 19.24 GB to 8.53 GB and doubles throughput from 15.16 to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.