Object Detection in Video with Spatiotemporal Sampling Networks
Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi

TL;DR
The paper introduces a Spatiotemporal Sampling Network (STSN) that enhances video object detection by learning to sample features across frames, improving robustness to occlusion and motion blur without extra supervision.
Contribution
It presents a novel deformable convolution approach across time for video object detection that does not rely on optical flow or additional supervision.
Findings
Outperforms state-of-the-art on ImageNet VID dataset
Simpler design compared to prior methods
Does not require optical flow data for training
Abstract
We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally renders the approach robust to occlusion or motion blur in individual frames. Our framework does not require additional supervision, as it optimizes sampling locations directly with respect to object detection performance. Our STSN outperforms the state-of-the-art on the ImageNet VID dataset and compared to prior video object detection methods it uses a simpler design, and does not require optical flow data for training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
