VideoSET: Video Summary Evaluation through Text

Serena Yeung; Alireza Fathi; and Li Fei-Fei

arXiv:1406.5824·cs.CV·June 24, 2014·42 cites

VideoSET: Video Summary Evaluation through Text

Serena Yeung, Alireza Fathi, and Li Fei-Fei

PDF

Open Access

TL;DR

VideoSET introduces a text-based evaluation method for video summaries that aligns more closely with human judgment by measuring semantic similarity between generated and ground-truth text summaries.

Contribution

The paper proposes a novel text-based evaluation approach for video summaries that better captures semantic content compared to pixel-based metrics.

Findings

01

Higher agreement with human judgment than pixel-based metrics

02

Effective semantic evaluation of video summaries

03

Provides annotated datasets for community use

Abstract

In this paper we present VideoSET, a method for Video Summary Evaluation through Text that can evaluate how well a video summary is able to retain the semantic information contained in its original video. We observe that semantics is most easily expressed in words, and develop a text-based approach for the evaluation. Given a video summary, a text representation of the video summary is first generated, and an NLP-based metric is then used to measure its semantic distance to ground-truth text summaries written by humans. We show that our technique has higher agreement with human judgment than pixel-based distance metrics. We also release text annotations and ground-truth text summaries for a number of publicly available video datasets, for use by the computer vision community.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques