Loom: Diffusion-Transformer for Interleaved Generation
Mingcheng Ye, Jiaming Liu, Yiren Song

TL;DR
Loom is a diffusion-transformer framework that enables coherent, controllable, and efficient interleaved text-image generation for complex tasks like style transfer and tutorials, outperforming existing models.
Contribution
Loom introduces a novel interleaved architecture with full-parameter fine-tuning and a language planning strategy for improved multi-condition reasoning and sequential synthesis.
Findings
Loom achieves 2.6-point improvement over Anole in temporal and semantic metrics.
Loom demonstrates superior compositionality, temporal coherence, and text-image alignment.
Loom outperforms baseline models on a new 50K interleaved tutorial dataset.
Abstract
Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship
