TL;DR
RankE introduces an end-to-end post-training framework for discrete text-to-image models that co-evolves policy and decoder components, improving both image fidelity and alignment.
Contribution
It is the first to enable joint optimization of policy and decoder in discrete T2I models, overcoming the fidelity-alignment trade-off of frozen-decoder methods.
Findings
RankE improves both CLIP score and FID simultaneously.
Standard RL enhances CLIP but degrades FID, unlike RankE.
Consistent gains on multiple models demonstrate effective decoder co-evolution.
Abstract
Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
