TL;DR
TBAC-UniImage presents a new unified model that combines a pre-trained diffusion model with a multimodal large language model, using multi-layer representations to enhance understanding and generation capabilities.
Contribution
It introduces a ladder-based approach that leverages multiple MLLM layers as conditions for diffusion, improving integration without extensive retraining.
Findings
Enhanced multimodal understanding and generation.
Deeper integration of hierarchical representations.
Efficient use of pre-trained models without full retraining.
Abstract
This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
