Video Captioning: a comparative review of where we are and which could be the route
Daniela Moctezuma, Tania Ram\'irez-delReal, Guillermo Ruiz, Oth\'on, Gonz\'alez-Ch\'avez

TL;DR
This paper provides a comprehensive review of video captioning research from 2016 to 2021, analyzing methods, datasets, and performance to identify current trends and future opportunities.
Contribution
It offers an extensive comparative analysis of over 105 papers, ranking methods, and suggesting promising directions for future research in video captioning.
Findings
Identified the most-used datasets and metrics.
Ranked the top-performing methods based on performance metrics.
Provided insights into future research opportunities.
Abstract
Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. Dealing with this task with a single image is arduous, not to mention how difficult it is for a video (or images sequence). The amount and relevance of the applications of video captioning are vast, mainly to deal with a significant amount of video recordings in video surveillance, or assisting people visually impaired, to mention a few. To analyze where the efforts of our community to solve the video captioning task are, as well as what route could be better to follow, this manuscript presents an extensive review of more than 105 papers for the period of 2016 to 2021. As a result, the most-used datasets and metrics are identified. Also, the main approaches used and the best ones. We compute a set of rankings based on several performance metrics to obtain,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Video Analysis and Summarization
