A Hierarchical Approach for Visual Storytelling Using Image Description
Md Sultan Al Nahian, Tasmia Tasrin, Sagar Gandhi, Ryan Gaines, and, Brent Harrison

TL;DR
This paper introduces a hierarchical deep learning model that uses image descriptions and images to generate coherent, long visual stories, outperforming existing methods on the VIST dataset.
Contribution
A novel hierarchical encoder-decoder architecture incorporating image descriptions to improve long-term context and diversity in visual storytelling.
Findings
Outperforms state-of-the-art on VIST dataset
Demonstrates importance of hierarchical structure
Validates effectiveness of image descriptions in storytelling
Abstract
One of the primary challenges of visual storytelling is developing techniques that can maintain the context of the story over long event sequences to generate human-like stories. In this paper, we propose a hierarchical deep learning architecture based on encoder-decoder networks to address this problem. To better help our network maintain this context while also generating long and diverse sentences, we incorporate natural language image descriptions along with the images themselves to generate each story sentence. We evaluate our system on the Visual Storytelling (VIST) dataset and show that our method outperforms state-of-the-art techniques on a suite of different automatic evaluation metrics. The empirical results from this evaluation demonstrate the necessities of different components of our proposed architecture and shows the effectiveness of the architecture for visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
