Text-Only Training for Visual Storytelling

Yuechen Wang; Wengang Zhou; Zhenbo Lu; Houqiang Li

arXiv:2308.08881·cs.CV·August 21, 2023

Text-Only Training for Visual Storytelling

Yuechen Wang, Wengang Zhou, Zhenbo Lu, Houqiang Li

PDF

Open Access

TL;DR

This paper introduces a novel text-only training approach for visual storytelling that leverages a cross-modality pre-trained model and a visual condition planner, enabling effective story generation from image sequences without requiring paired image-text data.

Contribution

It proposes a new method that trains visual storytelling models solely on text data, separating cross-modality alignment from story generation, and uses a visual condition planner for temporal structure understanding.

Findings

01

Outperforms existing methods on the VIST benchmark

02

Enhances generalization to cross-domain scenarios

03

Improves diversity and human-rated quality of generated stories

Abstract

Visual storytelling aims to generate a narrative based on a sequence of images, necessitating both vision-language alignment and coherent story generation. Most existing solutions predominantly depend on paired image-text training data, which can be costly to collect and challenging to scale. To address this, we formulate visual storytelling as a visual-conditioned story generation problem and propose a text-only training method that separates the learning of cross-modality alignment and story generation. Our approach specifically leverages the cross-modality pre-trained CLIP model to integrate visual control into a story generator, trained exclusively on text data. Moreover, we devise a training-free visual condition planner that accounts for the temporal structure of the input image sequence while balancing global and local visual content. The distinctive advantage of requiring only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Digital Storytelling and Education · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training