MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking
En Yu, Tiancai Wang, Zhuoling Li, Yuang Zhang, Xiangyu Zhang, Wenbing, Tao

TL;DR
MOTRv3 introduces a novel release-fetch supervision strategy that balances label assignment in end-to-end multi-object tracking, improving performance without extra detection networks during inference.
Contribution
This work reveals the label assignment conflict in end-to-end MOT and proposes MOTRv3 with release-fetch supervision to address it, enhancing tracking accuracy.
Findings
Achieves state-of-the-art results on MOT17 and DanceTrack benchmarks.
Eliminates the need for an additional detection network during inference.
Improves convergence dynamics in end-to-end multi-object tracking.
Abstract
Although end-to-end multi-object trackers like MOTR enjoy the merits of simplicity, they suffer from the conflict between detection and association seriously, resulting in unsatisfactory convergence dynamics. While MOTRv2 partly addresses this problem, it demands an additional detection network for assistance. In this work, we serve as the first to reveal that this conflict arises from the unfair label assignment between detect queries and track queries during training, where these detect queries recognize targets and track queries associate them. Based on this observation, we propose MOTRv3, which balances the label assignment process using the developed release-fetch supervision strategy. In this strategy, labels are first released for detection and gradually fetched back for association. Besides, another two strategies named pseudo label distillation and track group denoising are…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The paper is well-written and clearly organized. 2. The observation regarding the disproportional assignment of track and detection queries is insightful, with a clear analysis that adds valuable understanding for the MOT community. 3. The proposed Release-Fetch Supervision (RFS) method is simple yet effective. Due to its simplicity, it could potentially benefit various Transformer-based MOT methods.
1. Limited Scenarios for RFS Technique: The need for RFS is primarily in scenarios where there is more video data than image data for training, leading to a disproportionate assignment of track and detection queries. However, in most cases, such as in large-scale [1] and open-vocabulary MOT [2] tasks, the opposite is true, with more image detection data than video tracking data. In these cases, joint training with both image and tracking data is a common practice, providing sufficient supervisio
Compared to MOTRv2, MOTRv3 returns to a fully end-to-end design, eliminating the dependence on external detection models. This improves the model’s integration and inference efficiency, making it suitable for real-time applications.
1. Neither v1 nor v2 utilized pseudo label distillation. In v3, Pseudo Label Distillation (PLD) enhances the training of detection queries by using high-quality pseudo labels, further improving detection accuracy. While effective, this approach does not alter the fundamental framework. 2. v3 introduces Track Group Denoising (TGD), which enhances the stability of tracking queries, making the model more robust in dynamic scenes and reducing issues like ID switching. TGD can be seen as an extension
- The finding about the detection/association conflict provided in Fig. 1 is somehow interesting.
- The novelty is relatively limited and incremental: + The improvement can be simply interpreted as adding a lot more pseudo-GT from the TGD components into both detection and association stages from the 1st to 5th decoders, named RFS. + The TGD component is changing the assignment to track query group and GT. - The writing is not clear as many technical details are explained in plain texts rather than algorithms or equations, which compromises readability and reproducibility. - The nume
1. The idea to alleviate the label imbalance between the optimization of detection and association subtasks in MOT is sound and appealing. 2. The paper is well-organized with theoretical analysis, motivation, and methodology. Also, comprehensive experimental evaluations and comparisons are provided. 3. The proposed RFS strategy is effective with sufficient explanation. And the good overall performance of the proposed MOTRv3 is inspiring for end-to-end MOT.
1. For evaluation results on DanceTrack, the performance improvement is very limited compared to MOTRv2. 2. The experiments are not fair and unconvining since MOTRV3 applies the ConvNext-Base as backbone, while others uses ResNet50. 3. The autohrs should further discuss the label assignment of previous object detection and tracking, such as: [1] Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment, ICCV 2023 [2] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR, ICLR 20
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Chemical Sensor Technologies · Domain Adaptation and Few-Shot Learning
