Multimodal Representation Alignment for Image Generation: Text-Image   Interleaved Control Is Easier Than You Think

Liang Chen; Shuai Bai; Wenhao Chai; Weichu Xie; Haozhe Zhao; Leon; Vinci; Junyang Lin; Baobao Chang

arXiv:2502.20172·cs.CV·February 28, 2025

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon, Vinci, Junyang Lin, Baobao Chang

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces Dream Engine, a unified framework for arbitrary text-image interleaved control in image generation, leveraging multimodal models for improved alignment and flexibility, achieving competitive results with state-of-the-art models.

Contribution

The paper proposes Dream Engine, a novel framework that enables flexible text-image interleaved control by integrating multimodal encoders and a two-stage training process.

Findings

01

Achieves a 0.69 score on the GenEval benchmark.

02

Matches performance of SD3.5 and FLUX models.

03

Effective two-stage training paradigm for multimodal alignment.

Abstract

The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenllliang/dreamengine
jaxOfficial

Models

🤗
leonardPKU/DreamEngine-ObjectFusion
model· 31 dl· ♡ 6
31 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship

MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Absolute Position Encodings · Inverse Square Root Schedule · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Attention Is All You Need