Diverse Video Captioning by Adaptive Spatio-temporal Attention

Zohreh Ghaderi; Leonard Salewski; Hendrik P. A. Lensch

arXiv:2208.09266·cs.CV·August 22, 2022·1 cites

Diverse Video Captioning by Adaptive Spatio-temporal Attention

Zohreh Ghaderi, Leonard Salewski, Hendrik P. A. Lensch

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel end-to-end video captioning framework using adaptive spatio-temporal attention with transformers, achieving state-of-the-art results and diverse captions on multiple benchmarks.

Contribution

It introduces an adaptive frame selection scheme and a combined transformer architecture for improved video understanding and caption generation.

Findings

01

Achieves state-of-the-art results on MSVD, MSR-VTT, and VATEX datasets.

02

Demonstrates improved diversity and expressiveness in generated captions.

03

Reduces computational load with adaptive frame selection.

Abstract

To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zohrehghaderi/vasta
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Cancer-related molecular mechanisms research