BOTT: Box Only Transformer Tracker for 3D Object Tracking
Lubing Zhou, Xiaoli Meng, Yiluan Guo, Jiong Yang

TL;DR
This paper introduces BOTT, a transformer-based method for 3D object tracking that learns to link objects across frames using global box embeddings, reducing engineering complexity and achieving competitive results.
Contribution
BOTT is the first transformer-based approach to directly learn 3D object linking from box data, eliminating the need for handcrafted motion models.
Findings
Achieves 69.9 and 66.7 AMOTA on nuScenes validation and test.
Attains 56.45 and 59.57 MOTA L2 on Waymo datasets.
Seamlessly supports online and offline tracking modes.
Abstract
Tracking 3D objects is an important task in autonomous driving. Classical Kalman Filtering based methods are still the most popular solutions. However, these methods require handcrafted designs in motion modeling and can not benefit from the growing data amounts. In this paper, Box Only Transformer Tracker (BOTT) is proposed to learn to link 3D boxes of the same object from the different frames, by taking all the 3D boxes in a time window as input. Specifically, transformer self-attention is applied to exchange information between all the boxes to learn global-informative box embeddings. The similarity between these learned embeddings can be used to link the boxes of the same object. BOTT can be used for both online and offline tracking modes seamlessly. Its simplicity enables us to significantly reduce engineering efforts required by traditional Kalman Filtering based methods.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Absolute Position Encodings · Residual Connection
