High-Resolution Image Synthesis via Next-Token Prediction
Dengsheng Chen, Jie Hu, Tiezhu Yue, Xiaoming Wei, Enhua Wu

TL;DR
This paper presents D-JEPA·T2I, a novel autoregressive model that generates high-resolution, photorealistic images up to 4K by integrating advanced architecture, training strategies, and continuous resolution learning, achieving state-of-the-art results.
Contribution
It introduces a new autoregressive approach combining continuous tokens, multimodal transformer, flow matching loss, and dynamic training feedback for high-resolution image synthesis.
Findings
Achieved state-of-the-art high-resolution image synthesis.
Generated photorealistic images up to 4K resolution.
Demonstrated effective integration of textual and visual features.
Abstract
Recently, autoregressive models have demonstrated remarkable performance in class-conditional image generation. However, the application of next-token prediction to high-resolution text-to-image generation remains largely unexplored. In this paper, we introduce \textbf{D-JEPAT2I}, an autoregressive model based on continuous tokens that incorporates innovations in both architecture and training strategy to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K. Architecturally, we adopt the denoising joint embedding predictive architecture (D-JEPA) while leveraging a multimodal visual transformer to effectively integrate textual and visual features. Additionally, we introduce flow matching loss alongside the proposed Visual Rotary Positional Embedding (VoPE) to enable continuous resolution learning. In terms of training strategy, we propose a data feedback…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis
