Vision Transformer Based Model for Describing a Set of Images as a Story

Zainy M. Malakan; Ghulam Mubashar Hassan; Ajmal Mian

arXiv:2210.02762·cs.CV·July 17, 2023

Vision Transformer Based Model for Describing a Set of Images as a Story

Zainy M. Malakan, Ghulam Mubashar Hassan, Ajmal Mian

PDF

TL;DR

This paper introduces a novel Vision Transformer-based model for visual storytelling that effectively captures visual variation and context from image sets, resulting in more coherent and relevant stories.

Contribution

It combines Vision Transformers with sequence encoding and attention mechanisms to improve visual storytelling from image sets, outperforming existing models.

Findings

01

Outperforms current state-of-the-art models on VIST dataset

02

Effectively captures visual variation and context in image sets

03

Enhances story coherence and relevance

Abstract

Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Softmax · Byte Pair Encoding · Adam · Vision Transformer · Dense Connections