FusionSAM: Visual Multi-Modal Learning with Segment Anything

Daixun Li; Weiying Xie; Mingxiang Cao; Yunke Wang; Yusi Zhang; Leyuan Fang; Yunsong Li; Chang Xu

arXiv:2408.13980·cs.CV·June 25, 2025·3 cites

FusionSAM: Visual Multi-Modal Learning with Segment Anything

Daixun Li, Weiying Xie, Mingxiang Cao, Yunke Wang, Yusi Zhang, Leyuan Fang, Yunsong Li, Chang Xu

PDF

Open Access

TL;DR

FusionSAM introduces a novel prompt-based multimodal segmentation framework that leverages the Segment Anything Model for improved dense element segmentation in autonomous driving scenes.

Contribution

It is the first to incorporate SAM into multimodal segmentation, using latent space token generation and fusion mask prompting for controllable, high-precision segmentation.

Findings

01

Outperforms SAM and SAM2 in multimodal autonomous driving datasets.

02

Achieves 4.1% higher segmentation mIoU than previous state-of-the-art methods.

03

Enhances segmentation performance across various multi-modal visual scenes.

Abstract

Multimodal image fusion and semantic segmentation are critical for autonomous driving. Despite advancements, current models often struggle with segmenting densely packed elements due to a lack of comprehensive fusion features for guidance during training. While the Segment Anything Model (SAM) allows precise control during fine-tuning through its flexible prompting encoder, its potential remains largely unexplored in the context of multimodal segmentation for natural images. In this paper, we introduce SAM into multimodal image segmentation for the first time, proposing a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules. This approach transforms the training methodology for multimodal segmentation from a traditional black-box approach to a controllable, prompt-based mechanism. Specifically, we obtain latent space features for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications

MethodsSoftmax · Attention Is All You Need · Segment Anything Model · Focus