StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization
Gopalji Gaur, Mohammadreza Zolfaghari, Thomas Brox

TL;DR
StorySync offers a training-free method for maintaining subject consistency in text-to-image generation, using region harmonization and attention sharing to produce coherent visual stories without retraining models.
Contribution
It introduces a novel, training-free approach combining masked cross-image attention sharing and regional feature harmonization for subject consistency in diffusion models.
Findings
Achieves consistent subjects across story scenes
Maintains creative diversity of generated images
Operates efficiently without model fine-tuning
Abstract
Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
