Loom: Diffusion-Transformer for Interleaved Generation

Mingcheng Ye; Jiaming Liu; Yiren Song

arXiv:2512.18254·cs.CV·December 23, 2025

Loom: Diffusion-Transformer for Interleaved Generation

Mingcheng Ye, Jiaming Liu, Yiren Song

PDF

Open Access 1 Datasets

TL;DR

Loom is a diffusion-transformer framework that enables coherent, controllable, and efficient interleaved text-image generation for complex tasks like style transfer and tutorials, outperforming existing models.

Contribution

Loom introduces a novel interleaved architecture with full-parameter fine-tuning and a language planning strategy for improved multi-condition reasoning and sequential synthesis.

Findings

01

Loom achieves 2.6-point improvement over Anole in temporal and semantic metrics.

02

Loom demonstrates superior compositionality, temporal coherence, and text-image alignment.

03

Loom outperforms baseline models on a new 50K interleaved tutorial dataset.

Abstract

Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

plantian/Loom_01
dataset· 16k dl
16k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship