DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking
Sijia Chen, Lijuan Ma, Yanqiu Yu, En Yu, Liman Liu, Wenbing Tao

TL;DR
This paper introduces DRMOT, a new 3D-aware multi-object tracking task using RGB, Depth, and Language data, along with a dataset and a novel framework to improve spatial-semantic grounding and tracking accuracy.
Contribution
It proposes the DRMOT task, creates the DRSet dataset with RGB-D-L data, and develops DRTrack, a depth-aware tracking framework guided by multi-modal large language models.
Findings
DRTrack outperforms existing methods on DRSet.
Depth information significantly improves tracking accuracy.
The dataset enables evaluation of spatial-semantic grounding in 3D-aware tracking.
Abstract
Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models' spatial-semantic grounding and tracking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Surveillance and Tracking Methods · Human Pose and Action Recognition
