Vision Transformer Based Model for Describing a Set of Images as a Story
Zainy M. Malakan, Ghulam Mubashar Hassan, Ajmal Mian

TL;DR
This paper introduces a novel Vision Transformer-based model for visual storytelling that effectively captures visual variation and context from image sets, resulting in more coherent and relevant stories.
Contribution
It combines Vision Transformers with sequence encoding and attention mechanisms to improve visual storytelling from image sets, outperforming existing models.
Findings
Outperforms current state-of-the-art models on VIST dataset
Effectively captures visual variation and context in image sets
Enhances story coherence and relevance
Abstract
Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Softmax · Byte Pair Encoding · Adam · Vision Transformer · Dense Connections
