Compressed Vision for Efficient Video Understanding
Olivia Wiles, Joao Carreira, Iain Barr, Andrew Zisserman and, Mateusz Malinowski

TL;DR
This paper introduces a neural compression-based framework for efficient processing of hour-long videos, enabling training and analysis on longer videos with standard hardware, and addressing augmentation challenges in compressed domain.
Contribution
The authors propose a novel neural compression approach that allows direct input of compressed videos into standard networks, significantly improving efficiency for long video processing.
Findings
Efficient training on Kinetics600 and COIN benchmarks.
Successful processing of hour-long videos using compressed representations.
Introduction of a latent code augmentation network.
Abstract
Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · CCD and CMOS Imaging Sensors
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
