VA-$\pi$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Xinyao Liao; Qiyuan He; Kai Xu; Xiaoye Qu; Yicong Li; Wei Wei; Angela Yao

arXiv:2512.19680·cs.CV·December 23, 2025

VA-$\pi$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, Angela Yao

PDF

Open Access 1 Models

TL;DR

VA-$\pi$ is a novel framework that directly optimizes autoregressive image generators in pixel space, improving image quality and consistency without retraining tokenizers, using a variational approach and reinforcement learning.

Contribution

It introduces VA-$\pi$, a post-training method that aligns tokenization with pixel-level quality through a variational objective and reinforcement-based policy optimization.

Findings

01

Reduces FID from 14.36 to 7.65 on ImageNet-1K

02

Improves IS from 86.55 to 116.70 on ImageNet-1K

03

Enhances visual and multi-modal generation metrics with minimal data and time

Abstract

Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA- $π$ , a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA- $π$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA- $π$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
LilShake66/VA-Pi
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis