Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow

Bowen Zheng; Yihong Luo; Tianyang Hu

arXiv:2605.06148·cs.CV·May 8, 2026

Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow

Bowen Zheng, Yihong Luo, Tianyang Hu

PDF

TL;DR

This paper introduces wAR-Tok, a new tokenizer training method that aligns token distributions with autoregressive priors using Wasserstein gradient flow, improving generation quality on CIFAR-10 and ImageNet.

Contribution

It proposes a distribution-level prior-matching approach during tokenizer training, addressing prior mismatch issues in two-stage discrete image modeling.

Findings

01

wAR-Tok reduces autoregressive loss.

02

Improves generation FID on CIFAR-10 and ImageNet.

03

Maintains reconstruction quality while enhancing prior alignment.

Abstract

Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result, the learned tokens may preserve image information well but still be difficult for an autoregressive (AR) prior to predict from left to right. We analyze this mismatch using Tripartite Variational Consistency (TVC), which decomposes latent-variable learning into three consistency conditions: conditional-likelihood consistency, prior consistency, and posterior consistency. TVC shows that two-stage training preserves the reconstruction side but leaves prior consistency outside the tokenizer objective: the overall token distribution is fixed before the AR prior participates in training. Motivated by this view, we add a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.