Extended Self-Critical Pipeline for Transforming Videos to Text   (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg

Philipp Harzig; Moritz Einfalt; Katja Ludwig; Rainer Lienhart

arXiv:2112.14100·cs.CV·December 30, 2021

Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg

Philipp Harzig, Moritz Einfalt, Katja Ludwig, Rainer Lienhart

PDF

Open Access

TL;DR

This paper presents an improved Video-to-Text system using Transformer architectures and self-critical training, demonstrating significant performance gains over traditional image captioning pipelines for video captioning tasks.

Contribution

The authors adapt Transformer and X-Linear Attention Networks for video captioning and apply self-critical sequence training, resulting in improved captioning performance on VTT datasets.

Findings

01

Transformer-based models outperform traditional image captioning pipelines.

02

Self-critical training significantly boosts validation performance.

03

Transformer architecture yields captions that better match videos.

Abstract

The Multimedia and Computer Vision Lab of the University of Augsburg participated in the VTT task only. We use the VATEX and TRECVID-VTT datasets for training our VTT models. We base our model on the Transformer approach for both of our submitted runs. For our second model, we adapt the X-Linear Attention Networks for Image Captioning which does not yield the desired bump in scores. For both models, we train on the complete VATEX dataset and 90% of the TRECVID-VTT dataset for pretraining while using the remaining 10% for validation. We finetune both models with self-critical sequence training, which boosts the validation performance significantly. Overall, we find that training a Video-to-Text system on traditional Image Captioning pipelines delivers very poor performance. When switching to a Transformer-based architecture our results greatly improve and the generated captions match…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Position-Wise Feed-Forward Layer · Adam