Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation
Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta,, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao

TL;DR
This paper introduces InstaManip, a multi-modal autoregressive model that enables rapid, few-shot image manipulation guided by text and visual examples, overcoming diffusion models' reasoning limitations.
Contribution
The paper proposes a novel group self-attention mechanism and relation regularization to improve few-shot image manipulation with in-context learning.
Findings
Outperforms previous models by at least 19% in human evaluation
Model's performance improves with more diverse exemplars
Effective disentanglement of image features achieved
Abstract
Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed , that can ntly learn a new image ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Image Processing Techniques · Advanced Vision and Imaging
MethodsDiffusion
