Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

Weiyi Lv; Ning Zhang; Hanyang Sun; Haoran Jiang; Kai Zhao; Jing Xiao; Dan Zeng

arXiv:2511.17681·cs.CV·November 25, 2025

Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

Weiyi Lv, Ning Zhang, Hanyang Sun, Haoran Jiang, Kai Zhao, Jing Xiao, Dan Zeng

PDF

Open Access

TL;DR

This paper introduces VMRMOT, a novel framework that enhances referring multi-object tracking by integrating motion modalities and leveraging multi-modal large language models for improved vision-reference alignment.

Contribution

It presents the first use of multi-modal large language models in RMOT, incorporating motion-aware descriptions and a hierarchical alignment module to improve tracking accuracy.

Findings

01

Outperforms existing state-of-the-art RMOT methods on multiple benchmarks.

02

Effectively captures dynamic object motions to improve temporal consistency.

03

Enhances multi-modal alignment through a novel Vision-Motion-Reference Alignment module.

Abstract

Referring Multi-Object Tracking (RMOT) extends conventional multi-object tracking (MOT) by introducing natural language references for multi-modal fusion tracking. RMOT benchmarks only describe the object's appearance, relative positions, and initial motion states. This so-called static regulation fails to capture dynamic changes of the object motion, including velocity changes and motion direction shifts. This limitation not only causes a temporal discrepancy between static references and dynamic vision modality but also constrains multi-modal tracking performance. To address this limitation, we propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT. It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references through multi-modal large language models (MLLMs). Specifically, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Gaze Tracking and Assistive Technology