SAM4D: Segment Anything in Camera and LiDAR Streams

Jianyun Xu; Song Wang; Ziqian Ni; Chunyong Hu; Sheng Yang; Jianke Zhu; Qiang Li

arXiv:2506.21547·cs.CV·June 27, 2025

SAM4D: Segment Anything in Camera and LiDAR Streams

Jianyun Xu, Song Wang, Ziqian Ni, Chunyong Hu, Sheng Yang, Jianke Zhu, Qiang Li

PDF

Open Access

TL;DR

SAM4D introduces a multi-modal foundation model for promptable segmentation in camera and LiDAR streams, utilizing novel alignment and memory mechanisms to improve robustness and efficiency in autonomous driving scene analysis.

Contribution

The paper presents SAM4D, a novel multi-modal, temporal segmentation model with unified encoding and motion-aware attention, plus an automated data engine for rapid pseudo-label generation.

Findings

01

Demonstrates strong cross-modal segmentation performance on Waymo-4DSeg

02

Achieves faster pseudo-label generation compared to manual annotation

03

Enhances temporal consistency in dynamic scenes

Abstract

We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRemote Sensing and LiDAR Applications · Video Surveillance and Tracking Methods · 3D Surveying and Cultural Heritage

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN