Variational Stacked Local Attention Networks for Diverse Video Captioning
Tonmoay Deb, Akib Sadmanee, Kishor Kumar Bhaumik, Amin Ahsan Ali, M, Ashraful Amin, A K M Mahbubur Rahman

TL;DR
This paper introduces VSLAN, a novel video captioning model that enhances feature interaction and diversity in generated captions through variational stacked local attention and multiple feature streams, outperforming existing methods.
Contribution
The paper proposes VSLAN, a new model that uses low-rank bilinear pooling and feature stacking to improve caption diversity and accuracy without explicit supervision.
Findings
VSLAN outperforms existing methods on CIDEr scores by 7.8% on MSVD and 4.5% on MSR-VTT.
VSLAN achieves competitive results in caption diversity metrics.
The model effectively captures fine-grained visual features for diverse caption generation.
Abstract
While describing Spatio-temporal events in natural language, video captioning models mostly rely on the encoder's latent visual representation. Recent progress on the encoder-decoder model attends encoder features mainly in linear interaction with the decoder. However, growing model complexity for visual data encourages more explicit feature interaction for fine-grained information, which is currently absent in the video captioning domain. Moreover, feature aggregations methods have been used to unveil richer visual representation, either by the concatenation or using a linear layer. Though feature sets for a video semantically overlap to some extent, these approaches result in objective mismatch and feature redundancy. In addition, diversity in captions is a fundamental component of expressing one event from several meaningful perspectives, currently missing in the temporal, i.e.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Variational Stacked Local Attention Networks for Diverse Video Captioning· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
