Autoregressive Image Generation with Vision Full-view Prompt

Miaomiao Cai; Guanjie Wang; Wei Li; Zhijun Tu; Hanting Chen; Shaohui; Lin; Jie Hu

arXiv:2502.16965·cs.CV·March 13, 2025

Autoregressive Image Generation with Vision Full-view Prompt

Miaomiao Cai, Guanjie Wang, Wei Li, Zhijun Tu, Hanting Chen, Shaohui, Lin, Jie Hu

PDF

Open Access

TL;DR

This paper introduces a novel Vision Full-view prompt (VF prompt) for autoregressive image generation, inspired by NLP prompt engineering, which improves image structure reconstruction, stability, and overall performance by simulating human visual perception processes.

Contribution

The paper proposes the VF prompt technique for AR image generation, enhancing contextual understanding and stability, and achieving approximately 20% performance improvement over previous methods.

Findings

01

20% performance improvement with VF prompts

02

Enhanced image structure reconstruction and stability

03

Better alignment with human visual perception processes

Abstract

In autoregressive (AR) image generation, models based on the 'next-token prediction' paradigm of LLMs have shown comparable performance to diffusion models by reducing inductive biases. However, directly applying LLMs to complex image generation can struggle with reconstructing the image's structure and details, impacting the generation's accuracy and stability. Additionally, the 'next-token prediction' paradigm in the AR model does not align with the contextual scanning and logical reasoning processes involved in human visual perception, limiting effective image generation. Prompt engineering, as a key technique for guiding LLMs, leverages specifically designed prompts to improve model performance on complex natural language processing (NLP) tasks, enhancing accuracy and stability of generation while maintaining contextual coherence and logical consistency, similar to human reasoning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging

MethodsDiffusion · ALIGN