RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Siyong Jian; Siyuan Li; Luyuan Zhang; Zedong Wang; Xin Jin; Ying Li; Cheng Tan; Huan Wang

arXiv:2605.21195·cs.CV·May 21, 2026

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Siyong Jian, Siyuan Li, Luyuan Zhang, Zedong Wang, Xin Jin, Ying Li, Cheng Tan, Huan Wang

PDF

1 Repo

TL;DR

RankE introduces an end-to-end post-training framework for discrete text-to-image models that co-evolves policy and decoder components, improving both image fidelity and alignment.

Contribution

It is the first to enable joint optimization of policy and decoder in discrete T2I models, overcoming the fidelity-alignment trade-off of frozen-decoder methods.

Findings

01

RankE improves both CLIP score and FID simultaneously.

02

Standard RL enhances CLIP but degrades FID, unlike RankE.

03

Consistent gains on multiple models demonstrate effective decoder co-evolution.

Abstract

Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

syjmelody/RankE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.