Diverse Video Captioning by Adaptive Spatio-temporal Attention
Zohreh Ghaderi, Leonard Salewski, Hendrik P. A. Lensch

TL;DR
This paper presents a novel end-to-end video captioning framework using adaptive spatio-temporal attention with transformers, achieving state-of-the-art results and diverse captions on multiple benchmarks.
Contribution
It introduces an adaptive frame selection scheme and a combined transformer architecture for improved video understanding and caption generation.
Findings
Achieves state-of-the-art results on MSVD, MSR-VTT, and VATEX datasets.
Demonstrates improved diversity and expressiveness in generated captions.
Reduces computational load with adaptive frame selection.
Abstract
To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Cancer-related molecular mechanisms research
