LaMOT: Language-Guided Multi-Object Tracking
Yunhao Li, Xiaoqiong Liu, Luke Liu, Heng Fan, Libo Zhang

TL;DR
This paper introduces LaMOT, a new benchmark and framework for vision-language multi-object tracking, enabling tracking based on natural language commands and providing a standardized platform for evaluation.
Contribution
It presents the first large-scale benchmark, LaMOT, and a simple tracker, LaMOTer, to advance research in language-guided multi-object tracking.
Findings
LaMOT benchmark includes 1,660 sequences from 4 datasets.
Provides a unified evaluation platform for Vision-Language MOT.
Introduces LaMOTer, an effective baseline tracker.
Abstract
Vision-Language MOT is a crucial tracking problem and has drawn increasing attention recently. It aims to track objects based on human language commands, replacing the traditional use of templates or pre-set information from training sets in conventional tracking tasks. Despite various efforts, a key challenge lies in the lack of a clear understanding of why language is used for tracking, which hinders further development in this field. In this paper, we address this challenge by introducing Language-Guided MOT, a unified task framework, along with a corresponding large-scale benchmark, termed LaMOT, which encompasses diverse scenarios and language descriptions. Specially, LaMOT comprises 1,660 sequences from 4 different datasets and aims to unify various Vision-Language MOT tasks while providing a standardized evaluation platform. To ensure high-quality annotations, we manually assign…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods
