LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Zeyu Wang; Zilong Chen; Chenhui Gou; Feng Li; Chaorui Deng; Deyao Zhu; Kunchang Li; Weihao Yu; Haoqin Tu; Haoqi Fan; Cihang Xie

arXiv:2510.22946·cs.CV·November 21, 2025

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie

PDF

Open Access

TL;DR

LightFusion introduces a double fusion framework that efficiently combines specialized models for multimodal understanding and generation, achieving high performance with reduced training resources.

Contribution

The paper proposes a novel double fusion mechanism that interleaves multimodal self-attention blocks, enabling effective multimodal fusion while preserving base model strengths.

Findings

01

Achieves 0.91 on GenEval for text-to-image generation

02

Scores 82.16 on DPG-Bench for complex image generation

03

Attains 6.06 on GEditBench for image editing

Abstract

Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Artificial Intelligence Applications