METEOR Guided Divergence for Video Captioning

Daniel Lukas Rothenpieler; Shahin Amiriparian

arXiv:2212.10690·cs.CV·December 22, 2022

METEOR Guided Divergence for Video Captioning

Daniel Lukas Rothenpieler, Shahin Amiriparian

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel audiovisual video captioning approach using a reward-guided KL divergence and a Bi-Modal Hierarchical Reinforcement Learning Transformer to improve long-term dependency modeling and sentence quality.

Contribution

It proposes a new training method and a hierarchical transformer architecture that better captures temporal dependencies and audiovisual information for video captioning.

Findings

01

Achieved BLEU3 score of 4.91, BLEU4 score of 2.23, and METEOR score of 10.80 on ActivityNet Captions.

02

Demonstrated improved caption quality with content completeness and grammatical correctness.

03

Provided publicly available code and models for further research.

Abstract

Automatic video captioning aims for a holistic visual scene understanding. It requires a mechanism for capturing temporal context in video frames and the ability to comprehend the actions and associations of objects in a given timeframe. Such a system should additionally learn to abstract video sequences into sensible representations as well as to generate natural written language. While the majority of captioning models focus solely on the visual inputs, little attention has been paid to the audiovisual modality. To tackle this issue, we propose a novel two-fold approach. First, we implement a reward-guided KL Divergence to train a video captioning model which is resilient towards token permutations. Second, we utilise a Bi-Modal Hierarchical Reinforcement Learning (BMHRL) Transformer architecture to capture long-term temporal dependencies of the input data as a foundation for our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

d-rothen/bmhrl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Absolute Position Encodings