FusionSAM: Visual Multi-Modal Learning with Segment Anything
Daixun Li, Weiying Xie, Mingxiang Cao, Yunke Wang, Yusi Zhang, Leyuan Fang, Yunsong Li, Chang Xu

TL;DR
FusionSAM introduces a novel prompt-based multimodal segmentation framework that leverages the Segment Anything Model for improved dense element segmentation in autonomous driving scenes.
Contribution
It is the first to incorporate SAM into multimodal segmentation, using latent space token generation and fusion mask prompting for controllable, high-precision segmentation.
Findings
Outperforms SAM and SAM2 in multimodal autonomous driving datasets.
Achieves 4.1% higher segmentation mIoU than previous state-of-the-art methods.
Enhances segmentation performance across various multi-modal visual scenes.
Abstract
Multimodal image fusion and semantic segmentation are critical for autonomous driving. Despite advancements, current models often struggle with segmenting densely packed elements due to a lack of comprehensive fusion features for guidance during training. While the Segment Anything Model (SAM) allows precise control during fine-tuning through its flexible prompting encoder, its potential remains largely unexplored in the context of multimodal segmentation for natural images. In this paper, we introduce SAM into multimodal image segmentation for the first time, proposing a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules. This approach transforms the training methodology for multimodal segmentation from a traditional black-box approach to a controllable, prompt-based mechanism. Specifically, we obtain latent space features for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
MethodsSoftmax · Attention Is All You Need · Segment Anything Model · Focus
