Video Joint Modelling Based on Hierarchical Transformer for   Co-summarization

Li Haopeng; Ke Qiuhong; Gong Mingming; Zhang Rui

arXiv:2112.13478·cs.CV·June 30, 2022

Video Joint Modelling Based on Hierarchical Transformer for Co-summarization

Li Haopeng, Ke Qiuhong, Gong Mingming, Zhang Rui

PDF

2 Repos

TL;DR

This paper introduces VJMHT, a hierarchical transformer model that jointly models multiple videos to improve summarization by capturing cross-video semantic dependencies, leading to more informative summaries.

Contribution

The paper proposes a novel hierarchical transformer framework for co-summarization that explicitly models semantic dependencies across similar videos.

Findings

01

VJMHT outperforms existing methods in F-measure and ranking evaluations.

02

Cross-video semantic modeling improves summarization quality.

03

Transformer-based representation reconstruction enhances summary fidelity.

Abstract

Video summarization aims to automatically generate a summary (storyboard or video skim) of a video, which can facilitate large-scale video retrieval and browsing. Most of the existing methods perform video summarization on individual videos, which neglects the correlations among similar videos. Such correlations, however, are also informative for video understanding and video summarization. To address this limitation, we propose Video Joint Modelling based on Hierarchical Transformer (VJMHT) for co-summarization, which takes into consideration the semantic dependencies across videos. Specifically, VJMHT consists of two layers of Transformer: the first layer extracts semantic representation from individual shots of similar videos, while the second layer performs shot-level video joint modelling to aggregate cross-video semantic information. By this means, complete cross-video high-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Label Smoothing · Byte Pair Encoding · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Adam