3D Object Tracking with Transformer
Yubo Cui, Zheng Fang, Jiayao Shan, Zuoxu Gu, Sifan Zhou

TL;DR
This paper introduces a transformer-based feature fusion network for 3D object tracking in point clouds, leveraging self- and cross-attention mechanisms to improve similarity computation and achieve state-of-the-art results on KITTI.
Contribution
It presents a novel transformer architecture for feature fusion in 3D object tracking, enhancing similarity computation and tracking accuracy.
Findings
Achieves state-of-the-art performance on KITTI dataset.
Effective use of self- and cross-attention in point cloud feature fusion.
End-to-end framework simplifies 3D object tracking pipeline.
Abstract
Feature fusion and similarity computation are two core problems in 3D object tracking, especially for object tracking using sparse and disordered point clouds. Feature fusion could make similarity computing more efficient by including target object information. However, most existing LiDAR-based approaches directly use the extracted point cloud feature to compute similarity while ignoring the attention changes of object regions during tracking. In this paper, we propose a feature fusion network based on transformer architecture. Benefiting from the self-attention mechanism, the transformer encoder captures the inter- and intra- relations among different regions of the point cloud. By using cross-attention, the transformer decoder fuses features and includes more target cues into the current point cloud feature to compute the region attentions, which makes the similarity computing more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Infrared Thermography in Medicine
