EGFormer: Towards Efficient and Generalizable Multimodal Semantic Segmentation
Zelin Zhang, Tao Zhang, KediLI, Xu Zheng

TL;DR
EGFormer is an efficient multimodal semantic segmentation framework that dynamically prioritizes and filters modalities, reducing computational costs significantly while maintaining high performance and demonstrating strong generalization in transfer tasks.
Contribution
The paper introduces EGFormer, a novel framework with modules for dynamic modality importance scoring and dropping, enabling efficient and generalizable multimodal segmentation.
Findings
Achieves up to 88% reduction in parameters
Reduces GFLOPs by 50% while maintaining accuracy
Outperforms existing methods in transfer learning tasks
Abstract
Recent efforts have explored multimodal semantic segmentation using various backbone architectures. However, while most methods aim to improve accuracy, their computational efficiency remains underexplored. To address this, we propose EGFormer, an efficient multimodal semantic segmentation framework that flexibly integrates an arbitrary number of modalities while significantly reducing model parameters and inference time without sacrificing performance. Our framework introduces two novel modules. First, the Any-modal Scoring Module (ASM) assigns importance scores to each modality independently, enabling dynamic ranking based on their feature maps. Second, the Modal Dropping Module (MDM) filters out less informative modalities at each stage, selectively preserving and aggregating only the most valuable features. This design allows the model to leverage useful information from all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
