Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation
Xu Zheng, Haiwei Xue, Jialei Chen, Yibo Yan, Lutao Jiang, Yuanhuiyi Lyu, Kailun Yang, Linfeng Zhang, Xuming Hu

TL;DR
This paper introduces a novel framework for training robust multimodal segmentors that effectively handle any combination of visual modalities by using cross-modal and unimodal distillation techniques, improving performance and reducing modality bias.
Contribution
The paper presents the first comprehensive framework for unimodal and cross-modal distillation in multimodal segmentation, enhancing robustness across various modality combinations.
Findings
Achieves superior performance on synthetic and real-world benchmarks.
Effectively reduces unimodal bias and over-reliance on specific modalities.
Demonstrates robustness in handling missing or incomplete modalities.
Abstract
Simultaneously using multimodal inputs from multiple sensors to train segmentors is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where multimodal segmentors over rely on certain modalities, causing performance drops when others are missing, common in real world applications. To this end, we develop the first framework for learning robust segmentor that can handle any combinations of visual modalities. Specifically, we first introduce a parallel multimodal learning strategy for learning a strong teacher. The cross-modal and unimodal distillation is then achieved in the multi scale representation space by transferring the feature level knowledge from multimodal to anymodal segmentors, aiming at addressing the unimodal bias and avoiding over-reliance on specific modalities. Moreover, a prediction level modality agnostic semantic distillation is…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Tackles a practically important and underexplored problem: robustness to missing modalities in multimodal segmentation, relevant to autonomous driving, robotics, and remote sensing. - Clear and well-justified motivation; the proposed method is well-designed and directly addresses the stated problem. - Empirical results demonstrate substantial improvements, validating the effectiveness of the proposed algorithm. - Well-written and easy to follow.
- Why the teacher model calculates the mean of multimodal features? The importance and semantic contributions of each modality to the segmentation task might vary. - Why the $L_{umd}$ introduces performance loss in Table 3?
The motivation is clear and convincing. The experiment is detailed, and the performance gain of PML is significant.
The cross-modal distillation is widely used in multimodal segmentation with missing modalities [1][2][3]. What is the new idea proposed in PML? Besides, the imbalanced learning in multimodal learning with the missing modalities is also studied by existing methods[1]. It introduces extra unimodal regularizers. The author should compare the proposed unimodal distillation with it. Comparasion with recent SOTA methods, such as works [3][4] The method includes three hyperparameters and performs in
1. The paper addresses the underexplored and critical issue of unimodal bias in multimodal semantic segmentation, with a precise focus on the robustness to missing modalities—a typical and practically important scenario for multi-sensor systems. 2. The paper is easy to follow.
1. Evaluation is limited to the SegFormer-B0 backbone. Including results with larger backbones (e.g., B3 or B5) or other architectures would help demonstrate scalability. 2. The current ablations focus mainly on hyperparameters. Analyzing robustness to noise or corruption, not just missing modalities, would better support claims of generalization. 3. The teacher–student setup effectively doubles training cost. The discussion of the trade-off between efficiency and performance is brief and coul
- The proposed method improves model performance at various missing modality scenarios compared to existing approaches. - The ablation study on the sensitivity of hyperparameters is comprehensive.
- The proposed method lacks sufficient novelty. Both unimodal and cross-modal knowledge distillation have been extensively explored in previous works on missing-modality learning, e.g., [R1–R5]. The paper does not adequately review this substantial body of literature nor clearly delineate how the proposed approach introduces new insights or technical advances beyond existing methods. Furthermore, the experimental evaluation does not include comparisons against these representative baselines, mak
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
