NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving
Kai Luo, Xu Wang, Rui Fan, Kailun Yang

TL;DR
NOVA introduces a generative, open-vocabulary approach to 3D multi-object tracking in autonomous driving, leveraging language priors and autoregressive models to improve identity consistency and generalization to unknown targets.
Contribution
It reformulates 3D tracking as spatio-temporal semantic sequence completion using LLMs, enabling open-world perception beyond closed-set limitations.
Findings
Achieves 22.41% AMOTA on nuScenes for novel categories.
Outperforms baseline by 20.21% in accuracy.
Uses a compact 0.5B autoregressive model.
Abstract
Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Autonomous Vehicle Technology and Safety · Human Pose and Action Recognition
