Local Compressed Video Stream Learning for Generic Event Boundary Detection
Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan

TL;DR
This paper introduces a novel end-to-end compressed video representation learning method for generic event boundary detection that leverages compressed domain information, reducing computational demands while improving detection accuracy.
Contribution
The proposed method uniquely utilizes compressed video data and local temporal modeling with LSTM and attention modules for efficient boundary detection.
Findings
Achieves significant accuracy improvements over previous methods.
Operates efficiently without full video decoding.
Demonstrates robustness on Kinetics-GEBD and TAPOS datasets.
Abstract
Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning · Advanced Vision and Imaging
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
