End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection
Congcong Li, Xinyao Wang, Longyin Wen, Dexiang Hong, Tiejian Luo, Libo, Zhang

TL;DR
This paper introduces an end-to-end compressed video representation learning approach for generic event boundary detection that operates directly in the compressed domain, significantly reducing computational costs while maintaining competitive accuracy.
Contribution
It proposes a novel method leveraging compressed video data, including motion vectors and residuals, for event boundary detection without full decoding, enhancing efficiency.
Findings
Achieves comparable accuracy to state-of-the-art methods.
Runs 4.5 times faster than existing approaches.
Effectively utilizes compressed domain information for boundary detection.
Abstract
Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which demands considerable computational power and storage space. To that end, we propose a new end-to-end compressed video representation learning for event boundary detection that leverages the rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we first use the ConvNets to extract features of the I-frames in the GOPs. After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames based on the motion vectors, residuals and representations of their dependent I-frames. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning · Advanced Vision and Imaging
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
