FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking
Martha Teiko Teye, Ori Maoz, Matthias Rottmann

TL;DR
Futrack introduces a transformer-based, multimodal camera-LiDAR framework for 3D multi-object tracking that enhances robustness and accuracy without requiring explicit motion models or extensive pretraining.
Contribution
The paper presents a novel multimodal transformer-based tracking framework that refines trajectories and improves re-identification in 3D MOT tasks.
Findings
Achieves 74.7 aMOTA on nuScenes test set.
Reduces identity switches compared to previous methods.
Demonstrates significant benefits of multimodal fusion in transformer tracking.
Abstract
We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by query-based tracking frameworks, FutrTrack employs a multimodal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multimodal bird's-eye-view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometric and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Human Pose and Action Recognition
