Imperial College London Submission to VATEX Video Captioning Task

Ozan Caglayan; Zixiu Wu; Pranava Madhyastha; Josiah Wang; Lucia Specia

arXiv:1910.07482·cs.CL·October 17, 2019

Imperial College London Submission to VATEX Video Captioning Task

Ozan Caglayan, Zixiu Wu, Pranava Madhyastha, Josiah Wang, Lucia Specia

PDF

Open Access

TL;DR

This paper presents the Imperial College London's approach to the VATEX video captioning challenge, exploring sequence models and different conditioning strategies to improve captioning performance.

Contribution

It investigates the impact of conditioning sequence models on visual entity predictions versus pooled features for video captioning.

Findings

01

Entity-based conditioning outperforms pooled features.

02

Baseline models achieved competitive scores.

03

Conditional models are close in performance to sequence-to-sequence baselines.

Abstract

This paper describes the Imperial College London team's submission to the 2019' VATEX video captioning challenge, where we first explore two sequence-to-sequence models, namely a recurrent (GRU) model and a transformer model, which generate captions from the I3D action features. We then investigate the effect of dropping the encoder and the attention mechanism and instead conditioning the GRU decoder over two different vectorial representations: (i) a max-pooled action feature vector and (ii) the output of a multi-label classifier trained to predict visual entities from the action features. Our baselines achieved scores comparable to the official baseline. Conditioning over entity predictions performed substantially better than conditioning on the max-pooled feature vector, and only marginally worse than the GRU-based sequence-to-sequence baseline.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax