RegTrack: Simplicity Beneath Complexity in Robust Multi-Modal 3D Multi-Object Tracking

Lipeng Gu; Xuefeng Yan; Song Wang; Mingqiang Wei

arXiv:2409.00618·cs.CV·February 25, 2026

RegTrack: Simplicity Beneath Complexity in Robust Multi-Modal 3D Multi-Object Tracking

Lipeng Gu, Xuefeng Yan, Song Wang, Mingqiang Wei

PDF

Open Access

TL;DR

RegTrack introduces a simple yet effective multi-modal 3D multi-object tracking method that leverages a unified tri-cue encoder inspired by gauge theory, achieving superior robustness and efficiency without complex association metrics.

Contribution

It proposes RegTrack, a novel 3D MOT approach that uses a tri-cue encoder and pairwise similarity-based association, challenging the need for complex, class-specific priors.

Findings

01

Outperforms 35 competitors on KITTI and nuScenes datasets.

02

Uses only 2.6 million parameters, demonstrating efficiency.

03

Achieves robust and generalizable tracking with point cloud inputs.

Abstract

Existing 3D multi-object tracking (MOT) methods often sacrifice efficiency and generalizability for robustness, largely relying on complex association metrics derived from multi-modal architectures and class-specific motion priors. Challenging the rooted belief that greater complexity necessarily yields greater robustness, we propose a robust, efficient, and generalizable method for multi-modal 3D MOT, dubbed RegTrack. Inspired by Yang-Mills gauge theory, RegTrack is built upon a unified tri-cue encoder (UTEnc), comprising three tightly coupled components: a local-global point cloud encoder (LG-PEnc), a mixture-of-experts-based geometry encoder (MoE-GEnc), and an image encoder from a well-pretrained visual-language model. LG-PEnc efficiently encodes the spatial and structural information of point clouds to produce foundational representations for each object, whose pairwise similarities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Human Pose and Action Recognition · Robotics and Sensor-Based Localization

MethodsContrastive Language-Image Pre-training