Tracking by Associating Clips
Sanghyun Woo, Kwanyong Park, Seoung Wug Oh, In So Kweon, Joon-Young, Lee

TL;DR
This paper proposes a clip-wise matching approach for multi-object tracking, improving robustness to interruptions and enhancing long-range association by leveraging short video clips instead of frame-by-frame matching.
Contribution
It introduces a novel clip-wise matching framework that mitigates error propagation and utilizes multi-frame information for better long-term tracking.
Findings
Improved tracking accuracy on TAO and MOT17 benchmarks.
Enhanced robustness to occlusions and abrupt scene changes.
Better long-range association compared to traditional frame-wise methods.
Abstract
The tracking-by-detection paradigm today has become the dominant method for multi-object tracking and works by detecting objects in each frame and then performing data association across frames. However, its sequential frame-wise matching property fundamentally suffers from the intermediate interruptions in a video, such as object occlusions, fast camera movements, and abrupt light changes. Moreover, it typically overlooks temporal information beyond the two frames for matching. In this paper, we investigate an alternative by treating object association as clip-wise matching. Our new perspective views a single long video sequence as multiple short clips, and then the tracking is performed both within and between the clips. The benefits of this new approach are two folds. First, our method is robust to tracking error accumulation or propagation, as the video chunking allows bypassing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Image and Video Quality Assessment · Human Pose and Action Recognition
MethodsContrastive Language-Image Pre-training
