Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models
Weiyi Lv, Ning Zhang, Hanyang Sun, Haoran Jiang, Kai Zhao, Jing Xiao, Dan Zeng

TL;DR
This paper introduces VMRMOT, a novel framework that enhances referring multi-object tracking by integrating motion modalities and leveraging multi-modal large language models for improved vision-reference alignment.
Contribution
It presents the first use of multi-modal large language models in RMOT, incorporating motion-aware descriptions and a hierarchical alignment module to improve tracking accuracy.
Findings
Outperforms existing state-of-the-art RMOT methods on multiple benchmarks.
Effectively captures dynamic object motions to improve temporal consistency.
Enhances multi-modal alignment through a novel Vision-Motion-Reference Alignment module.
Abstract
Referring Multi-Object Tracking (RMOT) extends conventional multi-object tracking (MOT) by introducing natural language references for multi-modal fusion tracking. RMOT benchmarks only describe the object's appearance, relative positions, and initial motion states. This so-called static regulation fails to capture dynamic changes of the object motion, including velocity changes and motion direction shifts. This limitation not only causes a temporal discrepancy between static references and dynamic vision modality but also constrains multi-modal tracking performance. To address this limitation, we propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT. It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references through multi-modal large language models (MLLMs). Specifically, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Gaze Tracking and Assistive Technology
