DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

Sijia Chen; Lijuan Ma; Yanqiu Yu; En Yu; Liman Liu; Wenbing Tao

arXiv:2602.04692·cs.CV·February 9, 2026

DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

Sijia Chen, Lijuan Ma, Yanqiu Yu, En Yu, Liman Liu, Wenbing Tao

PDF

Open Access

TL;DR

This paper introduces DRMOT, a new 3D-aware multi-object tracking task using RGB, Depth, and Language data, along with a dataset and a novel framework to improve spatial-semantic grounding and tracking accuracy.

Contribution

It proposes the DRMOT task, creates the DRSet dataset with RGB-D-L data, and develops DRTrack, a depth-aware tracking framework guided by multi-modal large language models.

Findings

01

DRTrack outperforms existing methods on DRSet.

02

Depth information significantly improves tracking accuracy.

03

The dataset enables evaluation of spatial-semantic grounding in 3D-aware tracking.

Abstract

Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models' spatial-semantic grounding and tracking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Surveillance and Tracking Methods · Human Pose and Action Recognition