VideoSET: Video Summary Evaluation through Text
Serena Yeung, Alireza Fathi, and Li Fei-Fei

TL;DR
VideoSET introduces a text-based evaluation method for video summaries that aligns more closely with human judgment by measuring semantic similarity between generated and ground-truth text summaries.
Contribution
The paper proposes a novel text-based evaluation approach for video summaries that better captures semantic content compared to pixel-based metrics.
Findings
Higher agreement with human judgment than pixel-based metrics
Effective semantic evaluation of video summaries
Provides annotated datasets for community use
Abstract
In this paper we present VideoSET, a method for Video Summary Evaluation through Text that can evaluate how well a video summary is able to retain the semantic information contained in its original video. We observe that semantics is most easily expressed in words, and develop a text-based approach for the evaluation. Given a video summary, a text representation of the video summary is first generated, and an NLP-based metric is then used to measure its semantic distance to ground-truth text summaries written by humans. We show that our technique has higher agreement with human judgment than pixel-based distance metrics. We also release text annotations and ground-truth text summaries for a number of publicly available video datasets, for use by the computer vision community.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques
