AudioVisual Video Summarization

Bin Zhao; Maoguo Gong; Xuelong Li

arXiv:2105.07667·cs.CV·May 18, 2021·1 cites

AudioVisual Video Summarization

Bin Zhao, Maoguo Gong, Xuelong Li

PDF

Open Access

TL;DR

This paper introduces a novel audiovisual recurrent network that jointly leverages audio and visual data to improve video summarization, demonstrating superior performance over visual-only methods on benchmark datasets.

Contribution

It proposes the AVRN model that fuses audio and visual features using a three-part architecture, enhancing video understanding for summarization.

Findings

01

AVRN outperforms visual-only methods on SumMe and TVsum datasets.

02

Each component of AVRN contributes significantly to overall performance.

03

Multimodal fusion improves the quality of video summaries.

Abstract

Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, existing approaches just exploit the visual information while neglect the audio information. In this paper, we argue that the audio modality can assist vision modality to better understand the video content and structure, and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream LSTM is utilized to encode the audio and visual feature sequentially by capturing their temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Music and Audio Processing · Digital Media Forensic Detection

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory