QTrack: Query-Driven Reasoning for Multi-modal MOT

Tajamul Ashraf; Tavaheed Tariq; Sonia Yadav; Abrar Ul Riyaz; Wasif Tak; Moloud Abdar; Janibul Bashir

arXiv:2603.13759·cs.CV·March 17, 2026

QTrack: Query-Driven Reasoning for Multi-modal MOT

Tajamul Ashraf, Tavaheed Tariq, Sonia Yadav, Abrar Ul Riyaz, Wasif Tak, Moloud Abdar, Janibul Bashir

PDF

Open Access 1 Models 1 Datasets

TL;DR

QTrack introduces a novel query-driven multi-object tracking framework that localizes and tracks objects based on natural language queries, supported by a new benchmark and a multimodal reasoning model.

Contribution

The paper proposes a new paradigm for language-guided tracking, a large-scale benchmark RMOT26, and an end-to-end model QTrack with motion-aware reasoning strategies.

Findings

01

QTrack outperforms existing methods on the RMOT26 benchmark.

02

The approach effectively localizes and tracks objects based on natural language queries.

03

The Temporal Perception-Aware Policy improves reasoning accuracy.

Abstract

Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
GAASH-Lab/QTrack
model

Datasets

GAASH-Lab/RMOT26
dataset· 8.7k dl
8.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Surveillance and Tracking Methods · Human Pose and Action Recognition