TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

Junzhe Xu; Yuyang Yin; Xi Chen

arXiv:2508.08098·cs.CV·August 15, 2025

TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

Junzhe Xu, Yuyang Yin, Xi Chen

PDF

1 Models

TL;DR

TBAC-UniImage presents a new unified model that combines a pre-trained diffusion model with a multimodal large language model, using multi-layer representations to enhance understanding and generation capabilities.

Contribution

It introduces a ladder-based approach that leverages multiple MLLM layers as conditions for diffusion, improving integration without extensive retraining.

Findings

01

Enhanced multimodal understanding and generation.

02

Deeper integration of hierarchical representations.

03

Efficient use of pre-trained models without full retraining.

Abstract

This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
TencentBAC/TBAC-UniImage-3B
model· 8 dl· ♡ 6
8 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.