Evaluation of Automatic Video Captioning Using Direct Assessment
Yvette Graham, George Awad, Alan Smeaton

TL;DR
This paper introduces Direct Assessment, a human-centered evaluation method for automatic video captions that addresses the limitations of automatic metrics by incorporating crowdsourced human judgments.
Contribution
The paper presents a novel crowdsourcing-based evaluation method for video captioning that improves reliability over traditional automatic metrics and accounts for assessor quality.
Findings
Direct Assessment is replicable and robust.
It effectively evaluates caption quality in the TRECVid 2016 dataset.
The method scales to multiple caption-generation techniques.
Abstract
We present Direct Assessment, a method for manually assessing the quality of automatically-generated captions for video. Evaluating the accuracy of video captions is particularly difficult because for any given video clip there is no definitive ground truth or correct answer against which to measure. Automatic metrics for comparing automatic video captions against a manual caption such as BLEU and METEOR, drawn from techniques used in evaluating machine translation, were used in the TRECVid video captioning task in 2016 but these are shown to have weaknesses. The work presented here brings human assessment into the evaluation by crowdsourcing how well a caption describes a video. We automatically degrade the quality of some sample captions which are assessed manually and from this we are able to rate the quality of the human assessors, a factor we take into account in the evaluation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
