Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg
Philipp Harzig, Moritz Einfalt, Katja Ludwig, Rainer Lienhart

TL;DR
This paper presents an improved Video-to-Text system using Transformer architectures and self-critical training, demonstrating significant performance gains over traditional image captioning pipelines for video captioning tasks.
Contribution
The authors adapt Transformer and X-Linear Attention Networks for video captioning and apply self-critical sequence training, resulting in improved captioning performance on VTT datasets.
Findings
Transformer-based models outperform traditional image captioning pipelines.
Self-critical training significantly boosts validation performance.
Transformer architecture yields captions that better match videos.
Abstract
The Multimedia and Computer Vision Lab of the University of Augsburg participated in the VTT task only. We use the VATEX and TRECVID-VTT datasets for training our VTT models. We base our model on the Transformer approach for both of our submitted runs. For our second model, we adapt the X-Linear Attention Networks for Image Captioning which does not yield the desired bump in scores. For both models, we train on the complete VATEX dataset and 90% of the TRECVID-VTT dataset for pretraining while using the remaining 10% for validation. We finetune both models with self-critical sequence training, which boosts the validation performance significantly. Overall, we find that training a Video-to-Text system on traditional Image Captioning pipelines delivers very poor performance. When switching to a Transformer-based architecture our results greatly improve and the generated captions match…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Position-Wise Feed-Forward Layer · Adam
