Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon, Vinci, Junyang Lin, Baobao Chang

TL;DR
This paper introduces Dream Engine, a unified framework for arbitrary text-image interleaved control in image generation, leveraging multimodal models for improved alignment and flexibility, achieving competitive results with state-of-the-art models.
Contribution
The paper proposes Dream Engine, a novel framework that enables flexible text-image interleaved control by integrating multimodal encoders and a two-stage training process.
Findings
Achieves a 0.69 score on the GenEval benchmark.
Matches performance of SD3.5 and FLUX models.
Effective two-stage training paradigm for multimodal alignment.
Abstract
The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Absolute Position Encodings · Inverse Square Root Schedule · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Attention Is All You Need
