Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization
Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Zeyu Zheng, Zirui Wang, Cihang Xie, Yuyin Zhou

TL;DR
Story-Iter is a training-free iterative framework that improves long story visualization by progressively refining images using global reference embeddings, achieving state-of-the-art results in semantic consistency and detail.
Contribution
It introduces a novel training-free iterative paradigm with a global reference cross-attention module for enhanced long story visualization.
Findings
Outperforms existing methods in semantic consistency.
Effective in handling up to 100 frames in long stories.
Demonstrates superior fine-grained interaction quality.
Abstract
This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official…
Peer Reviews
Decision·ICLR 2026 Poster
1. **Novel and Targeted Iterative Paradigm**: The proposed external iterative framework directly addresses the core limitations of existing AR and RI paradigms in long story visualization. By using full-length frames from the previous iteration as references (instead of fixed or limited frames), it effectively mitigates error accumulation and global consistency loss, a long-standing challenge in the field . 2. **Efficient and Lightweight GRCA Module**: The Global Reference Cross-Attention (GRCA
1. **Computational Efficiency for Extremely Long Stories**: While more efficient than baselines, generating a 100-frame 1024×1024 story still incurs 4.30 PFLOPs per iteration. If the paper were to include experiments based on a distilled single-step diffusion model to demonstrate the universality of its method, it would more fully prove the superiority of its method. 2. **Tradeoff Between Consistency and Text Alignment**: Longer iterations (≥10) slightly weaken text-image alignment (CLIP-T drop
1. **Novel and Effective Paradigm:** The core strength of the paper is the proposal of an iterative refinement paradigm for story visualization. This is a genuinely new approach in this domain that effectively addresses the key challenge of long-range consistency by allowing the model to gain a global view of the entire story and refine it over multiple passes. It elegantly sidesteps the error accumulation of AR models and the rigidity of RI models. 2. **State-of-the-Art Performance:** The me
1. **Prohibitive Computational Cost:** The most significant weakness is the method's computational expense. An $L$-iteration process results in an $L$-fold increase in generation time compared to single-pass methods. The default of 10 iterations makes the method an order of magnitude slower than its competitors. This is a major practical limitation that is understated in the main paper and largely relegated to the appendix. While the authors suggest using acceleration techniques, no experiments
1) Novel iterative paradigm: External iterations that refine all frames by referencing the complete previous sequence, effectively addressing error accumulation in autoregressive methods and fixed-reference limitations 2) Scalable GRCA module: Uses compact global embeddings rather than high-dimensional latent features, enabling 100+ frame stories with manageable memory (19GB vs 40GB for StoryDiffusion) 3) Strong empirical validation: Consistent improvements across multiple metrics and benchmarks
1) Base model (IP-Adapter) outdated. Even though we are in the era of Flux/SD3.5, use of IP-Adapter-based models with SD 1.5 (I am not sure whether it is 1.5 - I checked the code and you only provided StoryIter-XL version of the code) or SD-XL based implementation feels a bit outdated. I guess you used it to implement based on IP-Adapter, but when we consider Flux.1.Kontext or Nano-banana-like reference-image based image editing models, I think these can also achieve good result in generating go
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Data Visualization and Analytics · Artificial Intelligence in Games
MethodsDiffusion
