VA-$\pi$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, Angela Yao

TL;DR
VA-$\pi$ is a novel framework that directly optimizes autoregressive image generators in pixel space, improving image quality and consistency without retraining tokenizers, using a variational approach and reinforcement learning.
Contribution
It introduces VA-$\pi$, a post-training method that aligns tokenization with pixel-level quality through a variational objective and reinforcement-based policy optimization.
Findings
Reduces FID from 14.36 to 7.65 on ImageNet-1K
Improves IS from 86.55 to 116.70 on ImageNet-1K
Enhances visual and multi-modal generation metrics with minimal data and time
Abstract
Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA- formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA- introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
