Exploration of Visual Features and their weighted-additive fusion for   Video Captioning

Praveen S V; Akhilesh Bharadwaj; Harsh Raj; Janhavi Dadhania; Ganesh; Samarth C.A; Nikhil Pareek; S R M Prasanna

arXiv:2101.05806·cs.CV·January 18, 2021

Exploration of Visual Features and their weighted-additive fusion for Video Captioning

Praveen S V, Akhilesh Bharadwaj, Harsh Raj, Janhavi Dadhania, Ganesh, Samarth C.A, Nikhil Pareek, S R M Prasanna

PDF

Open Access

TL;DR

This paper presents WAFTM, a novel video captioning model that fuses visual features with weighted-additive methods and memory-augmented transformers, achieving improved performance on benchmark datasets.

Contribution

Introduction of WAFTM, a transformer-based video captioning model with memory and a new feature fusion technique that emphasizes significant visual representations.

Findings

01

Achieved CIDEr score of 92.4 on MSVD dataset.

02

Obtained METEOR score of 0.091 on ActivityNet Captions.

03

Demonstrated performance gains with Word-Piece Tokenization and REINFORCE.

Abstract

Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context. We introduce the Weighted Additive Fusion Transformer with Memory Augmented Encoders (WAFTM), a captioning model that incorporates memory in a transformer encoder and uses a novel method, to fuse features, that ensures due importance is given to more significant representations. We illustrate a gain in performance realized by applying Word-Piece Tokenization and a popular REINFORCE algorithm. Finally, we benchmark our model on two datasets and obtain a CIDEr of 92.4 on MSVD and a METEOR of 0.091 on the ActivityNet Captions Dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Attention Is All You Need · Byte Pair Encoding · Multi-Head Attention · Dropout · Softmax · Layer Normalization