Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling

Hongyang Wei; Hongbo Liu; Zidong Wang; Yi Peng; Baixin Xu; Size Wu; Xuying Zhang; Xianglong He; Zexiang Liu; Peiyu Wang; Xuchen Song; Yangguang Li; Yang Liu; Yahui Zhou

arXiv:2601.15664·cs.CV·January 23, 2026

Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling

Hongyang Wei, Hongbo Liu, Zidong Wang, Yi Peng, Baixin Xu, Size Wu, Xuying Zhang, Xianglong He, Zexiang Liu, Peiyu Wang, Xuchen Song, Yangguang Li, Yang Liu, Yahui Zhou

PDF

Open Access 3 Models 5 Datasets

TL;DR

Skywork UniPic 3.0 introduces a unified multi-image composition framework that leverages sequence modeling to enhance quality, efficiency, and flexibility in multi-image and single-image editing tasks, especially focusing on human-object interactions.

Contribution

The paper presents a novel sequence-modeling approach for multi-image composition, a comprehensive data pipeline, and a fast inference method, advancing the state-of-the-art in multi-image and single-image editing.

Findings

01

Achieves high-quality multi-image composition with only 700K training samples.

02

Produces high-fidelity samples in 8 steps, 12.5x faster than standard methods.

03

Outperforms existing models on multi-image composition benchmarks.

Abstract

The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community's strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Fusion Techniques · Visual Attention and Saliency Detection · Cell Image Analysis Techniques