Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation
Ao Ma, Jiasong Feng, Ke Cao, Jing Wang, Yun Wang, Quanwei Zhang, Zhanjie Zhang

TL;DR
Lay2Story introduces a diffusion transformer-based framework for layout-togglable storytelling, enabling fine-grained subject control and consistency across frames, supported by a large-scale dataset and benchmark.
Contribution
The paper presents Lay2Story, a novel diffusion transformer model, and a new dataset Lay2Story-1M for layout-guided storytelling, advancing control and quality in generated stories.
Findings
Outperforms previous SOTA methods in consistency and aesthetic quality
Enables precise control over subject attributes and positions
Provides a large-scale dataset and benchmark for future research
Abstract
Storytelling tasks involving generating consistent subjects have gained significant attention recently. However, existing methods, whether training-free or training-based, continue to face challenges in maintaining subject consistency due to the lack of fine-grained guidance and inter-frame interaction. Additionally, the scarcity of high-quality data in this field makes it difficult to precisely control storytelling tasks, including the subject's position, appearance, clothing, expression, and posture, thereby hindering further advancements. In this paper, we demonstrate that layout conditions, such as the subject's position and detailed attributes, effectively facilitate fine-grained interactions between frames. This not only strengthens the consistency of the generated frame sequence but also allows for precise control over the subject's position, appearance, and other key details.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
