DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models

Xunjie He; Christina Dao Wen Lee; Meiling Wang; Chengran Yuan; Zefan Huang; Yufeng Yue; Marcelo H. Ang Jr

arXiv:2506.07375·cs.CV·June 10, 2025

DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models

Xunjie He, Christina Dao Wen Lee, Meiling Wang, Chengran Yuan, Zefan Huang, Yufeng Yue, Marcelo H. Ang Jr

PDF

Open Access

TL;DR

This paper introduces DINO-CoDT, a multi-class collaborative detection and tracking framework utilizing vision foundation models, with novel modules for feature learning, re-identification, and adaptive track management, significantly improving accuracy in diverse road scenarios.

Contribution

The paper presents a new multi-class collaborative detection and tracking framework with innovative modules for feature fusion, re-identification, and adaptive tracking, addressing limitations of prior vehicle-focused methods.

Findings

01

Outperforms state-of-the-art in detection accuracy

02

Reduces ID switch errors effectively

03

Enhances tracking robustness across diverse object classes

Abstract

Collaborative perception plays a crucial role in enhancing environmental understanding by expanding the perceptual range and improving robustness against sensor failures, which primarily involves collaborative 3D detection and tracking tasks. The former focuses on object recognition in individual frames, while the latter captures continuous instance tracklets over time. However, existing works in both areas predominantly focus on the vehicle superclass, lacking effective solutions for both multi-class collaborative detection and tracking. This limitation hinders their applicability in real-world scenarios, which involve diverse object classes with varying appearances and motion patterns. To overcome these limitations, we propose a multi-class collaborative detection and tracking framework tailored for diverse road users. We first present a detector with a global spatial attention fusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Autonomous Vehicle Technology and Safety