ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation
Ayushman Sarkar, Zhenyu Yu, Chu Chen, Wei Tang, Kangning Cui, Mohd Yamani Idna Idris

TL;DR
ReDiStory is a training-free method that enhances visual story generation by reorganizing prompt embeddings to better preserve subject identity across multiple images, reducing semantic interference.
Contribution
It introduces a novel inference-time prompt reorganization technique that explicitly decomposes and decorrelates embeddings to improve identity consistency without additional training.
Findings
Improves identity consistency metrics on ConsiStory+ benchmark.
Maintains prompt fidelity while reducing cross-frame interference.
Operates without modifying diffusion models or requiring extra supervision.
Abstract
Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
