TL;DR
Prologue introduces a method to improve autoregressive image generation by adding a small set of learned prologue tokens, enhancing generation quality without compromising reconstruction fidelity.
Contribution
The paper proposes a decoupled approach using prologue tokens trained separately, enabling better generation quality while maintaining reconstruction performance.
Findings
Prologue reduces gFID from 21.01 to 10.75 on ImageNet 256x256.
Prologue-Large achieves rFID of 0.99 and gFID of 1.46 without auxiliary supervision.
Prologue tokens exhibit emergent semantic structure, with high linear probing accuracy.
Abstract
In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
