TL;DR
This paper introduces a unified ConvNet architecture that jointly performs object detection and tracking in videos, simplifying the process and achieving state-of-the-art results on large-scale datasets.
Contribution
The paper presents a novel ConvNet model for simultaneous detection and tracking, incorporating correlation features and linking detections across frames for improved accuracy.
Findings
Achieves state-of-the-art results on ImageNet VID dataset.
Outperforms previous methods with a simpler model.
Increasing temporal stride boosts tracking speed significantly.
Abstract
Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; and (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Detect to Track and Track to Detect· youtube
