Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation

Xu Zheng; Haiwei Xue; Jialei Chen; Yibo Yan; Lutao Jiang; Yuanhuiyi Lyu; Kailun Yang; Linfeng Zhang; Xuming Hu

arXiv:2411.17141·cs.CV·May 19, 2025

Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation

Xu Zheng, Haiwei Xue, Jialei Chen, Yibo Yan, Lutao Jiang, Yuanhuiyi Lyu, Kailun Yang, Linfeng Zhang, Xuming Hu

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces a novel framework for training robust multimodal segmentors that effectively handle any combination of visual modalities by using cross-modal and unimodal distillation techniques, improving performance and reducing modality bias.

Contribution

The paper presents the first comprehensive framework for unimodal and cross-modal distillation in multimodal segmentation, enhancing robustness across various modality combinations.

Findings

01

Achieves superior performance on synthetic and real-world benchmarks.

02

Effectively reduces unimodal bias and over-reliance on specific modalities.

03

Demonstrates robustness in handling missing or incomplete modalities.

Abstract

Simultaneously using multimodal inputs from multiple sensors to train segmentors is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where multimodal segmentors over rely on certain modalities, causing performance drops when others are missing, common in real world applications. To this end, we develop the first framework for learning robust segmentor that can handle any combinations of visual modalities. Specifically, we first introduce a parallel multimodal learning strategy for learning a strong teacher. The cross-modal and unimodal distillation is then achieved in the multi scale representation space by transferring the feature level knowledge from multimodal to anymodal segmentors, aiming at addressing the unimodal bias and avoiding over-reliance on specific modalities. Moreover, a prediction level modality agnostic semantic distillation is…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- Tackles a practically important and underexplored problem: robustness to missing modalities in multimodal segmentation, relevant to autonomous driving, robotics, and remote sensing. - Clear and well-justified motivation; the proposed method is well-designed and directly addresses the stated problem. - Empirical results demonstrate substantial improvements, validating the effectiveness of the proposed algorithm. - Well-written and easy to follow.

Weaknesses

- Why the teacher model calculates the mean of multimodal features? The importance and semantic contributions of each modality to the segmentation task might vary. - Why the $L_{umd}$ introduces performance loss in Table 3?

Reviewer 02Rating 4Confidence 4

Strengths

The motivation is clear and convincing. The experiment is detailed, and the performance gain of PML is significant.

Weaknesses

The cross-modal distillation is widely used in multimodal segmentation with missing modalities [1][2][3]. What is the new idea proposed in PML? Besides, the imbalanced learning in multimodal learning with the missing modalities is also studied by existing methods[1]. It introduces extra unimodal regularizers. The author should compare the proposed unimodal distillation with it. Comparasion with recent SOTA methods, such as works [3][4] The method includes three hyperparameters and performs in

Reviewer 03Rating 4Confidence 5

Strengths

1. The paper addresses the underexplored and critical issue of unimodal bias in multimodal semantic segmentation, with a precise focus on the robustness to missing modalities—a typical and practically important scenario for multi-sensor systems. 2. The paper is easy to follow.

Weaknesses

1. Evaluation is limited to the SegFormer-B0 backbone. Including results with larger backbones (e.g., B3 or B5) or other architectures would help demonstrate scalability. 2. The current ablations focus mainly on hyperparameters. Analyzing robustness to noise or corruption, not just missing modalities, would better support claims of generalization. 3. The teacher–student setup effectively doubles training cost. The discussion of the trade-off between efficiency and performance is brief and coul

Reviewer 04Rating 2Confidence 4

Strengths

- The proposed method improves model performance at various missing modality scenarios compared to existing approaches. - The ablation study on the sensitivity of hyperparameters is comprehensive.

Weaknesses

- The proposed method lacks sufficient novelty. Both unimodal and cross-modal knowledge distillation have been extensively explored in previous works on missing-modality learning, e.g., [R1–R5]. The paper does not adequately review this substantial body of literature nor clearly delineate how the proposed approach introduces new insights or technical advances beyond existing methods. Furthermore, the experimental evaluation does not include comparisons against these representative baselines, mak

Code & Models

Repositories

zhengxuJosh/AnySeg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques