StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
Adyasha Maharana, Darryl Hannan, and Mohit Bansal

TL;DR
This paper introduces StoryDALL-E, a method to adapt pretrained text-to-image transformers for story continuation tasks, enabling better generalization to new narratives and characters through task-specific modules and fine-tuning strategies.
Contribution
The paper proposes a novel approach to adapt pretrained text-to-image models for story continuation, including task-specific modules and evaluation on multiple datasets, outperforming GAN-based models.
Findings
Retro-fitting improves story continuity and element copying.
Pretrained transformers struggle with multi-character narratives.
Fine-tuning enhances model performance on story datasets.
Abstract
Recent advances in text-to-image synthesis have led to large pretrained transformers with excellent capabilities to generate visualizations from a given text. However, these models are ill-suited for specialized tasks like story visualization, which requires an agent to produce a sequence of images given a corresponding sequence of captions, forming a narrative. Moreover, we find that the story visualization task fails to accommodate generalization to unseen plots and characters in new narratives. Hence, we first propose the task of story continuation, where the generated visual story is conditioned on a source image, allowing for better generalization to narratives with new characters. Then, we enhance or 'retro-fit' the pretrained text-to-image synthesis models with task-specific modules for (a) sequential image generation and (b) copying relevant elements from an initial frame. Then,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation
