TL;DR
TDIOT introduces a novel inference architecture that combines detection and tracking using a pre-trained Mask R-CNN, incorporating appearance similarity, local search, scale adaptation, and verification for improved video object tracking.
Contribution
The paper presents a new inference architecture that enhances deep video object tracking by integrating detection with tracking components without additional training.
Findings
Outperforms state-of-the-art short-term trackers in accuracy.
Provides comparable long-term tracking performance.
Effective handling of scale changes and tracking discontinuities.
Abstract
Recent tracking-by-detection approaches use deep object detectors as target detection baseline, because of their high performance on still images. For effective video object tracking, object detection is integrated with a data association step performed by either a custom design inference architecture or an end-to-end joint training for tracking purpose. In this work, we adopt the former approach and use the pre-trained Mask R-CNN deep object detector as the baseline. We introduce a novel inference architecture placed on top of FPN-ResNet101 backbone of Mask R-CNN to jointly perform detection and tracking, without requiring additional training for tracking purpose. The proposed single object tracker, TDIOT, applies an appearance similarity-based temporal matching for data association. In order to tackle tracking discontinuities, we incorporate a local search and matching module into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRegion Proposal Network · Softmax · RoIAlign · Convolution · Mask R-CNN
