Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking
Riku Inoue, Shogo Sato, Kazuhiko Murasaki, Tomoyasu Shimada, Toshihiko Nishimura, Ryuichi Tanida

TL;DR
This paper introduces CUTAL, a clip-level active learning method for end-to-end multi-object tracking that leverages uncertainty and temporal diversity to reduce annotation costs while maintaining high performance.
Contribution
It proposes a novel clip-level active learning approach that aligns with modern multi-frame trackers, improving annotation efficiency and reducing labeling costs.
Findings
CUTAL outperforms frame-based active learning baselines.
Achieves comparable performance to full supervision with only 50% labeled data.
Effective in reducing annotation effort while maintaining tracking accuracy.
Abstract
Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
