TL;DR
This paper enhances visual story generation by introducing semantic consistency techniques, including dual learning, copy-transform mechanisms, and transformer models, leading to improved coherence, relevance, and evaluation methods.
Contribution
It proposes novel methods for improving visual story generation, focusing on semantic alignment, sequential consistency, and complex frame interactions, along with new evaluation metrics.
Findings
Improved visual coherence and relevance in generated stories.
Enhanced evaluation metrics correlating with human judgment.
Effective ablation of each proposed technique's impact.
Abstract
Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
