AR-MOT: Autoregressive Multi-object Tracking
Lianjie Jia, Yuhan Wu, Binghao Ran, Yifan Wang, Lijun Wang, Huchuan Lu

TL;DR
AR-MOT introduces an autoregressive, sequence generation approach to multi-object tracking within a large language model framework, enabling flexible, task-agnostic, and extensible tracking without task-specific architectures.
Contribution
It proposes a novel autoregressive paradigm for MOT that eliminates fixed output heads, incorporates a region-aware alignment, and supports long-term tracking through sequence-based modeling.
Findings
Achieves performance comparable to state-of-the-art on MOT17 and DanceTrack.
Demonstrates flexible integration of new modalities and instructions.
Validates the effectiveness of the autoregressive approach for general MOT tasks.
Abstract
As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Gaze Tracking and Assistive Technology
