Reconstructive Sequence-Graph Network for Video Summarization
Bin Zhao, Haopeng Li, Xiaoqiang Lu, Xuelong Li

TL;DR
This paper introduces a hierarchical model combining sequence and graph neural networks to improve video summarization by capturing both local and global dependencies, and employs an unsupervised reconstruction approach.
Contribution
The proposed Reconstructive Sequence-Graph Network (RSGN) effectively models multi-hop shot dependencies and uses reconstruction loss for unsupervised training, enhancing summary quality.
Findings
Outperforms existing methods on SumMe, TVsum, and VTW datasets.
Effectively captures both local and global shot dependencies.
Unsupervised training avoids reliance on annotated data.
Abstract
Exploiting the inner-shot and inter-shot dependencies is essential for key-shot based video summarization. Current approaches mainly devote to modeling the video as a frame sequence by recurrent neural networks. However, one potential limitation of the sequence models is that they focus on capturing local neighborhood dependencies while the high-order dependencies in long distance are not fully exploited. In general, the frames in each shot record a certain activity and vary smoothly over time, but the multi-hop relationships occur frequently among shots. In this case, both the local and global dependencies are important for understanding the video content. Motivated by this point, we propose a Reconstructive Sequence-Graph Network (RSGN) to encode the frames and shots as sequence and graph hierarchically, where the frame-level dependencies are encoded by Long Short-Term Memory (LSTM),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
