TL;DR
EMScore is a new reference-free metric for video captioning that measures similarity between videos and captions using embedding matching, outperforming existing metrics in correlation with human judgment and ability to detect hallucinations.
Contribution
Proposes EMScore, a novel embedding matching-based, reference-free metric for video captioning that combines coarse- and fine-grained visual and linguistic similarity measures.
Findings
EMScore shows higher correlation with human judgments.
EMScore effectively detects hallucinating captions.
The datasets VATEX-EVAL and ActivityNet-FOIL are introduced for evaluation.
Abstract
Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions. However, they have some insuperable drawbacks, e.g., they cannot handle videos without references, and they may result in biased evaluation due to the one-to-many nature of video-to-text and the neglect of visual relevance. From the human evaluator's viewpoint, a high-quality caption should be consistent with the provided video, but not necessarily be similar to the reference in literal or semantics. Inspired by human evaluation, we propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning, which directly measures similarity between video and candidate captions. Benefit from the recent development of large-scale pre-training models, we exploit a well pre-trained vision-language model to extract visual and linguistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
