UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation
Lunhao Duan, Shanshan Zhao, Wenjun Yan, Yinglun Li, Qing-Guo Chen,, Zhao Xu, Weihua Luo, Kaifu Zhang, Mingming Gong, Gui-Song Xia

TL;DR
This paper introduces UNIC-Adapter, a unified multi-modal transformer framework that enables flexible, controllable image generation from diverse inputs without needing multiple specialized models.
Contribution
The paper presents a novel unified adapter built on a multi-modal diffusion transformer, allowing controllable image synthesis across various conditions within a single model.
Findings
Effective control over pixel-level layouts and styles
Versatile performance across multiple image generation tasks
Outperforms specialized models in controllability
Abstract
Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Medical Image Segmentation Techniques
MethodsByte Pair Encoding · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Attention Is All You Need · Dense Connections · Residual Connection · Diffusion · Multi-Head Attention
