Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge

Yi Yang; Yiming Xu; Timo Kaiser; Hao Cheng; Bodo Rosenhahn; Michael Ying Yang

arXiv:2511.03332·cs.CV·November 6, 2025

Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge

Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang

PDF

Open Access

TL;DR

This paper introduces a training-free, two-stage zero-shot method combining FastTracker and LLaVA-Video to localize and track objects based on language queries in complex videos, achieving competitive results in the MOT25-StAG challenge.

Contribution

The paper presents a novel training-free, zero-shot approach for multi-object tracking with language queries using a combination of existing models, addressing complex real-world scenes.

Findings

01

Achieved second place in the MOT25-StAG challenge.

02

Attained m-HIoU of 20.68 and HOTA of 10.73 on the test set.

03

Demonstrated effectiveness of a training-free, multi-modal retrieval approach.

Abstract

In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization