Exploration of Visual Features and their weighted-additive fusion for Video Captioning
Praveen S V, Akhilesh Bharadwaj, Harsh Raj, Janhavi Dadhania, Ganesh, Samarth C.A, Nikhil Pareek, S R M Prasanna

TL;DR
This paper presents WAFTM, a novel video captioning model that fuses visual features with weighted-additive methods and memory-augmented transformers, achieving improved performance on benchmark datasets.
Contribution
Introduction of WAFTM, a transformer-based video captioning model with memory and a new feature fusion technique that emphasizes significant visual representations.
Findings
Achieved CIDEr score of 92.4 on MSVD dataset.
Obtained METEOR score of 0.091 on ActivityNet Captions.
Demonstrated performance gains with Word-Piece Tokenization and REINFORCE.
Abstract
Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context. We introduce the Weighted Additive Fusion Transformer with Memory Augmented Encoders (WAFTM), a captioning model that incorporates memory in a transformer encoder and uses a novel method, to fuse features, that ensures due importance is given to more significant representations. We illustrate a gain in performance realized by applying Word-Piece Tokenization and a popular REINFORCE algorithm. Finally, we benchmark our model on two datasets and obtain a CIDEr of 92.4 on MSVD and a METEOR of 0.091 on the ActivityNet Captions Dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Attention Is All You Need · Byte Pair Encoding · Multi-Head Attention · Dropout · Softmax · Layer Normalization
