StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

Bingyu Li; Da Zhang; Zhiyuan Zhao; Junyu Gao; Xuelong Li

arXiv:2408.01343·cs.CV·August 8, 2025·2 cites

StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

PDF

Open Access 1 Repo

TL;DR

StitchFusion introduces a flexible multimodal fusion framework that leverages pre-trained models and a novel MultiAdapter module to enhance semantic segmentation across various visual modalities with minimal additional parameters.

Contribution

The paper presents StitchFusion, a simple yet effective framework that enables multi-modal and multi-scale feature fusion during encoding using shared pre-trained models and a new MultiAdapter for cross-modal information transfer.

Findings

01

Achieves state-of-the-art results on four multi-modal segmentation datasets.

02

Demonstrates the effectiveness of MultiAdapter in enhancing cross-modal feature exchange.

03

Shows that combining MultiAdapter with existing Feature Fusion Modules is complementary.

Abstract

Multimodal semantic segmentation shows significant potential for enhancing segmentation accuracy in complex scenes. However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. This approach facilitates comprehensive multi-modal and multi-scale feature fusion, accommodating any visual modal inputs. Specifically, Our framework achieves modal integration during encoding by sharing multi-modal visual information. To enhance information exchange across modalities, we introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

libingyu01/stitchfusion-stitchfusion-weaving-any-visual-modalities-to-enhance-multimodal-semantic-segmentation
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Hand Gesture Recognition Systems · Tactile and Sensory Interactions

MethodsAdapter