Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge
Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang

TL;DR
This paper introduces a training-free, two-stage zero-shot method combining FastTracker and LLaVA-Video to localize and track objects based on language queries in complex videos, achieving competitive results in the MOT25-StAG challenge.
Contribution
The paper presents a novel training-free, zero-shot approach for multi-object tracking with language queries using a combination of existing models, addressing complex real-world scenes.
Findings
Achieved second place in the MOT25-StAG challenge.
Attained m-HIoU of 20.68 and HOTA of 10.73 on the test set.
Demonstrated effectiveness of a training-free, multi-modal retrieval approach.
Abstract
In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
