ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation

Ayushman Sarkar; Zhenyu Yu; Chu Chen; Wei Tang; Kangning Cui; Mohd Yamani Idna Idris

arXiv:2602.01303·cs.CV·February 3, 2026

ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation

Ayushman Sarkar, Zhenyu Yu, Chu Chen, Wei Tang, Kangning Cui, Mohd Yamani Idna Idris

PDF

Open Access

TL;DR

ReDiStory is a training-free method that enhances visual story generation by reorganizing prompt embeddings to better preserve subject identity across multiple images, reducing semantic interference.

Contribution

It introduces a novel inference-time prompt reorganization technique that explicitly decomposes and decorrelates embeddings to improve identity consistency without additional training.

Findings

01

Improves identity consistency metrics on ConsiStory+ benchmark.

02

Maintains prompt fidelity while reducing cross-frame interference.

03

Operates without modifying diffusion models or requiring extra supervision.

Abstract

Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning