High-Resolution Image Synthesis via Next-Token Prediction

Dengsheng Chen; Jie Hu; Tiezhu Yue; Xiaoming Wei; Enhua Wu

arXiv:2411.14808·cs.CV·March 4, 2025

High-Resolution Image Synthesis via Next-Token Prediction

Dengsheng Chen, Jie Hu, Tiezhu Yue, Xiaoming Wei, Enhua Wu

PDF

Open Access

TL;DR

This paper presents D-JEPA·T2I, a novel autoregressive model that generates high-resolution, photorealistic images up to 4K by integrating advanced architecture, training strategies, and continuous resolution learning, achieving state-of-the-art results.

Contribution

It introduces a new autoregressive approach combining continuous tokens, multimodal transformer, flow matching loss, and dynamic training feedback for high-resolution image synthesis.

Findings

01

Achieved state-of-the-art high-resolution image synthesis.

02

Generated photorealistic images up to 4K resolution.

03

Demonstrated effective integration of textual and visual features.

Abstract

Recently, autoregressive models have demonstrated remarkable performance in class-conditional image generation. However, the application of next-token prediction to high-resolution text-to-image generation remains largely unexplored. In this paper, we introduce \textbf{D-JEPA $\cdot$ T2I}, an autoregressive model based on continuous tokens that incorporates innovations in both architecture and training strategy to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K. Architecturally, we adopt the denoising joint embedding predictive architecture (D-JEPA) while leveraging a multimodal visual transformer to effectively integrate textual and visual features. Additionally, we introduce flow matching loss alongside the proposed Visual Rotary Positional Embedding (VoPE) to enable continuous resolution learning. In terms of training strategy, we propose a data feedback…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis