X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Zeyi Sun; Ziyang Chu; Pan Zhang; Tong Wu; Xiaoyi Dong; Yuhang Zang; Yuanjun Xiong; Dahua Lin; Jiaqi Wang

arXiv:2412.01824·cs.CV·August 28, 2025

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

PDF

Open Access 1 Repo

TL;DR

X-Prompt is a novel auto-regressive vision-language model that leverages in-context learning to perform a wide range of image generation tasks, including unseen ones, with improved generalization and efficiency.

Contribution

The paper introduces X-Prompt, a unified auto-regressive model that effectively uses in-context learning for diverse and unseen image generation tasks, advancing the capabilities of vision-language models.

Findings

01

X-Prompt achieves competitive results on various seen image generation tasks.

02

The model demonstrates strong generalization to unseen image generation tasks.

03

Efficient in-context feature compression supports longer context sequences.

Abstract

In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sunzey/x-prompt
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques