Object Detection in Video with Spatiotemporal Sampling Networks

Gedas Bertasius; Lorenzo Torresani; and Jianbo Shi

arXiv:1803.05549·cs.CV·July 25, 2018·30 cites

Object Detection in Video with Spatiotemporal Sampling Networks

Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi

PDF

Open Access

TL;DR

The paper introduces a Spatiotemporal Sampling Network (STSN) that enhances video object detection by learning to sample features across frames, improving robustness to occlusion and motion blur without extra supervision.

Contribution

It presents a novel deformable convolution approach across time for video object detection that does not rely on optical flow or additional supervision.

Findings

01

Outperforms state-of-the-art on ImageNet VID dataset

02

Simpler design compared to prior methods

03

Does not require optical flow data for training

Abstract

We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally renders the approach robust to occlusion or motion blur in individual frames. Our framework does not require additional supervision, as it optimizes sampling locations directly with respect to object detection performance. Our STSN outperforms the state-of-the-art on the ImageNet VID dataset and compared to prior video object detection methods it uses a simpler design, and does not require optical flow data for training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning