Text-Only Training for Visual Storytelling
Yuechen Wang, Wengang Zhou, Zhenbo Lu, Houqiang Li

TL;DR
This paper introduces a novel text-only training approach for visual storytelling that leverages a cross-modality pre-trained model and a visual condition planner, enabling effective story generation from image sequences without requiring paired image-text data.
Contribution
It proposes a new method that trains visual storytelling models solely on text data, separating cross-modality alignment from story generation, and uses a visual condition planner for temporal structure understanding.
Findings
Outperforms existing methods on the VIST benchmark
Enhances generalization to cross-domain scenarios
Improves diversity and human-rated quality of generated stories
Abstract
Visual storytelling aims to generate a narrative based on a sequence of images, necessitating both vision-language alignment and coherent story generation. Most existing solutions predominantly depend on paired image-text training data, which can be costly to collect and challenging to scale. To address this, we formulate visual storytelling as a visual-conditioned story generation problem and propose a text-only training method that separates the learning of cross-modality alignment and story generation. Our approach specifically leverages the cross-modality pre-trained CLIP model to integrate visual control into a story generator, trained exclusively on text data. Moreover, we devise a training-free visual condition planner that accounts for the temporal structure of the input image sequence while balancing global and local visual content. The distinctive advantage of requiring only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Digital Storytelling and Education · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
