PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language   Instructions

Weifeng Lin; Xinyu Wei; Renrui Zhang; Le Zhuo; Shitian Zhao; Siyuan; Huang; Huan Teng; Junlin Xie; Yu Qiao; Peng Gao; Hongsheng Li

arXiv:2409.15278·cs.CV·February 28, 2025

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan, Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, Hongsheng Li

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

PixWizard is a versatile image-to-image assistant that unifies multiple vision tasks into a single framework, enabling flexible, instruction-driven image generation and manipulation across diverse resolutions and tasks.

Contribution

The paper introduces PixWizard, a unified model for diverse vision tasks guided by natural language instructions, with a large instruction-tuning dataset and a flexible resolution mechanism.

Findings

01

Demonstrates strong generalization to unseen tasks and instructions.

02

Achieves high-quality image generation and editing across various resolutions.

03

Shows effective fusion of image structure and semantics.

Abstract

This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

afeng-x/pixwizard
pytorchOfficial

Models

🤗
Afeng-x/PixWizard
model· ♡ 9
♡ 9

Datasets

Afeng-x/PixWizard-Data-500k
dataset· 120 dl
120 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training · Diffusion