Video captioning with stacked attention and semantic hard pull

Md. Mushfiqur Rahman; Thasin Abedin; Khondokar S. S. Prottoy; Ayana; Moshruba; Fazlul Hasan Siddiqui

arXiv:2009.07335·cs.CV·October 17, 2023·1 cites

Video captioning with stacked attention and semantic hard pull

Md. Mushfiqur Rahman, Thasin Abedin, Khondokar S. S. Prottoy, Ayana, Moshruba, Fazlul Hasan Siddiqui

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel video captioning architecture called SSVC that employs stacked attention and spatial hard pull to improve semantic accuracy, validated through both quantitative and qualitative evaluations.

Contribution

The paper proposes the SSVC model with innovative stacked attention and spatial hard pull mechanisms for enhanced semantic video captioning.

Findings

01

Improved BLEU scores over state-of-the-art models

02

Higher Semantic Sensibility (SS) scores in human evaluations

03

Effective combination of attention and hard pull techniques

Abstract

Video captioning, i.e. the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, of the problem, the results obtained in recent research works are praiseworthy. However, there is plenty of scope for further investigation. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise two sequential/recurrent layers - one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modifies the context generation mechanism by using two novel approaches - "stacked attention" and "spatial hard pull". As there are no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mushfiqur11/SS-VideoCaptioning
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization