DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Dianyi Wang; Ruihang Li; Feng Han; Chaofan Ma; Wei Song; Siyuan Wang; Yibin Wang; Yi Xin; Hongjian Liu; Zhixiong Zhang; Shengyuan Ding; Tianhang Wang; Zhenglin Cheng; Tao Lin; Cheng Jin; Kaicheng Yu; Jingjing Chen; Wenjie Wang; Zhongyu Wei; Jiaqi Wang

arXiv:2602.12205·cs.CV·February 16, 2026

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang

PDF

Open Access 4 Models 2 Datasets

TL;DR

DeepGen 1.0 is a lightweight 5B multimodal model that achieves state-of-the-art image generation and editing capabilities through innovative training strategies and a novel alignment framework, all with reduced resource requirements.

Contribution

Introduces Stacked Channel Bridging and a progressive training strategy to enable a compact model to perform competitively in image generation and editing tasks.

Findings

01

Surpasses larger models on key benchmarks by 28-37%.

02

Achieves high performance with only ~50M training samples.

03

Democratizes multimodal research with open-source resources.

Abstract

Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship