VideoXum: Cross-modal Visual and Textural Summarization of Videos

Jingyang Lin; Hang Hua; Ming Chen; Yikang Li; Jenhao Hsiao; Chiuman; Ho; Jiebo Luo

arXiv:2303.12060·cs.CV·April 24, 2024·1 cites

VideoXum: Cross-modal Visual and Textural Summarization of Videos

Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman, Ho, Jiebo Luo

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces VideoXum, a new dataset and model for joint video and text summarization, aiming to produce semantically aligned visual and textual summaries from long videos.

Contribution

It presents the first large-scale dataset VideoXum with human annotations for cross-modal summarization and proposes a novel end-to-end model VTSUM-BILP for this task.

Findings

01

Model achieves promising results on the new task.

02

Introduces VT-CLIPScore for evaluating semantic consistency.

03

Establishes a benchmark for future cross-modal summarization research.

Abstract

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jylins/videoxum
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Music and Audio Processing · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training