NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving

Kai Luo; Xu Wang; Rui Fan; Kailun Yang

arXiv:2603.06254·cs.CV·March 9, 2026

NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving

Kai Luo, Xu Wang, Rui Fan, Kailun Yang

PDF

Open Access

TL;DR

NOVA introduces a generative, open-vocabulary approach to 3D multi-object tracking in autonomous driving, leveraging language priors and autoregressive models to improve identity consistency and generalization to unknown targets.

Contribution

It reformulates 3D tracking as spatio-temporal semantic sequence completion using LLMs, enabling open-world perception beyond closed-set limitations.

Findings

01

Achieves 22.41% AMOTA on nuScenes for novel categories.

02

Outperforms baseline by 20.21% in accuracy.

03

Uses a compact 0.5B autoregressive model.

Abstract

Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Autonomous Vehicle Technology and Safety · Human Pose and Action Recognition