PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan, Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, Hongsheng Li

TL;DR
PixWizard is a versatile image-to-image assistant that unifies multiple vision tasks into a single framework, enabling flexible, instruction-driven image generation and manipulation across diverse resolutions and tasks.
Contribution
The paper introduces PixWizard, a unified model for diverse vision tasks guided by natural language instructions, with a large instruction-tuning dataset and a flexible resolution mechanism.
Findings
Demonstrates strong generalization to unseen tasks and instructions.
Achieves high-quality image generation and editing across various resolutions.
Shows effective fusion of image structure and semantics.
Abstract
This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsSparse Evolutionary Training · Diffusion
