QTrack: Query-Driven Reasoning for Multi-modal MOT
Tajamul Ashraf, Tavaheed Tariq, Sonia Yadav, Abrar Ul Riyaz, Wasif Tak, Moloud Abdar, Janibul Bashir

TL;DR
QTrack introduces a novel query-driven multi-object tracking framework that localizes and tracks objects based on natural language queries, supported by a new benchmark and a multimodal reasoning model.
Contribution
The paper proposes a new paradigm for language-guided tracking, a large-scale benchmark RMOT26, and an end-to-end model QTrack with motion-aware reasoning strategies.
Findings
QTrack outperforms existing methods on the RMOT26 benchmark.
The approach effectively localizes and tracks objects based on natural language queries.
The Temporal Perception-Aware Policy improves reasoning accuracy.
Abstract
Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Surveillance and Tracking Methods · Human Pose and Action Recognition
