Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation

Ao Ma; Jiasong Feng; Ke Cao; Jing Wang; Yun Wang; Quanwei Zhang; Zhanjie Zhang

arXiv:2508.08949·cs.CV·August 13, 2025

Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation

Ao Ma, Jiasong Feng, Ke Cao, Jing Wang, Yun Wang, Quanwei Zhang, Zhanjie Zhang

PDF

TL;DR

Lay2Story introduces a diffusion transformer-based framework for layout-togglable storytelling, enabling fine-grained subject control and consistency across frames, supported by a large-scale dataset and benchmark.

Contribution

The paper presents Lay2Story, a novel diffusion transformer model, and a new dataset Lay2Story-1M for layout-guided storytelling, advancing control and quality in generated stories.

Findings

01

Outperforms previous SOTA methods in consistency and aesthetic quality

02

Enables precise control over subject attributes and positions

03

Provides a large-scale dataset and benchmark for future research

Abstract

Storytelling tasks involving generating consistent subjects have gained significant attention recently. However, existing methods, whether training-free or training-based, continue to face challenges in maintaining subject consistency due to the lack of fine-grained guidance and inter-frame interaction. Additionally, the scarcity of high-quality data in this field makes it difficult to precisely control storytelling tasks, including the subject's position, appearance, clothing, expression, and posture, thereby hindering further advancements. In this paper, we demonstrate that layout conditions, such as the subject's position and detailed attributes, effectively facilitate fine-grained interactions between frames. This not only strengthens the consistency of the generated frame sequence but also allows for precise control over the subject's position, appearance, and other key details.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.