Variational Stacked Local Attention Networks for Diverse Video   Captioning

Tonmoay Deb; Akib Sadmanee; Kishor Kumar Bhaumik; Amin Ahsan Ali; M; Ashraful Amin; A K M Mahbubur Rahman

arXiv:2201.00985·cs.CV·January 5, 2022

Variational Stacked Local Attention Networks for Diverse Video Captioning

Tonmoay Deb, Akib Sadmanee, Kishor Kumar Bhaumik, Amin Ahsan Ali, M, Ashraful Amin, A K M Mahbubur Rahman

PDF

Open Access 1 Video

TL;DR

This paper introduces VSLAN, a novel video captioning model that enhances feature interaction and diversity in generated captions through variational stacked local attention and multiple feature streams, outperforming existing methods.

Contribution

The paper proposes VSLAN, a new model that uses low-rank bilinear pooling and feature stacking to improve caption diversity and accuracy without explicit supervision.

Findings

01

VSLAN outperforms existing methods on CIDEr scores by 7.8% on MSVD and 4.5% on MSR-VTT.

02

VSLAN achieves competitive results in caption diversity metrics.

03

The model effectively captures fine-grained visual features for diverse caption generation.

Abstract

While describing Spatio-temporal events in natural language, video captioning models mostly rely on the encoder's latent visual representation. Recent progress on the encoder-decoder model attends encoder features mainly in linear interaction with the decoder. However, growing model complexity for visual data encourages more explicit feature interaction for fine-grained information, which is currently absent in the video captioning domain. Moreover, feature aggregations methods have been used to unveil richer visual representation, either by the concatenation or using a linear layer. Though feature sets for a video semantically overlap to some extent, these approaches result in objective mismatch and feature redundancy. In addition, diversity in captions is a fundamental component of expressing one event from several meaningful perspectives, currently missing in the temporal, i.e.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Variational Stacked Local Attention Networks for Diverse Video Captioning· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition