Video Description: A Survey of Methods, Datasets and Evaluation Metrics

Nayyer Aafaq; Ajmal Mian; Wei Liu; Syed Zulqarnain Gilani; Mubarak; Shah

arXiv:1806.00186·cs.CV·March 4, 2020·95 cites

Video Description: A Survey of Methods, Datasets and Evaluation Metrics

Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, Mubarak, Shah

PDF

Open Access

TL;DR

This survey reviews recent deep learning approaches, datasets, and evaluation metrics for automatic video description, highlighting challenges and future directions in this rapidly evolving field.

Contribution

It provides a comprehensive comparison of state-of-the-art methods, datasets, and metrics, and discusses the challenges and limitations in current video description research.

Findings

01

Deep learning models dominate current approaches.

02

Existing datasets lack diversity and linguistic complexity.

03

Evaluation metrics have pros and cons in assessing quality.

Abstract

Video description is the automatic generation of natural language sentences that describe the contents of a given video. It has applications in human-robot interaction, helping the visually impaired and video subtitling. The past few years have seen a surge of research in this area due to the unprecedented success of deep learning in computer vision and natural language processing. Numerous methods, datasets and evaluation metrics have been proposed in the literature, calling the need for a comprehensive survey to focus research efforts in this flourishing new direction. This paper fills the gap by surveying the state of the art approaches with a focus on deep learning models; comparing benchmark datasets in terms of their domains, number of classes, and repository size; and identifying the pros and cons of various evaluation metrics like SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning