Reading Between the Lines: Exploring Infilling in Visual Narratives
Khyathi Raghavi Chandu, Ruo-Ping Dong, Alan Black

TL;DR
This paper introduces a novel infilling approach for generating coherent visual narratives, supported by a large-scale dataset, resulting in improved storytelling quality over existing methods.
Contribution
It presents a new infilling technique for visual narrative generation and introduces the ViPT dataset with 46,200 procedures and 340k image-text pairs.
Findings
Achieved a METEOR score of 27.51, surpassing state-of-the-art in visual storytelling.
Demonstrated improved coherence in generated narratives using infilling.
Showcased the effectiveness of infilling in handling missing steps and images.
Abstract
Generating long form narratives such as stories and procedures from multiple modalities has been a long standing dream for artificial intelligence. In this regard, there is often crucial subtext that is derived from the surrounding contexts. The general seq2seq training methods render the models shorthanded while attempting to bridge the gap between these neighbouring contexts. In this paper, we tackle this problem by using \textit{infilling} techniques involving prediction of missing steps in a narrative while generating textual descriptions from a sequence of images. We also present a new large scale \textit{visual procedure telling} (ViPT) dataset with a total of 46,200 procedures and around 340k pairwise images and textual descriptions that is rich in such contextual dependencies. Generating steps using infilling technique demonstrates the effectiveness in visual procedures with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence
