SiMO: Single-Modality-Operable Multimodal Collaborative Perception
Jiageng Wen, Shengjie Zhao, Bing Li, Jiafeng Huang, Kenan Ye, Hao Deng

TL;DR
SiMO is a novel collaborative perception framework that maintains performance across modalities by adaptively fusing features and ensuring modality independence, even during sensor failures.
Contribution
Introduces SiMO, a multimodal perception method that handles modal failures and preserves modality-specific features through adaptive fusion and a new training strategy.
Findings
Effective alignment of multimodal features.
Maintains performance during modal failures.
Preserves modality-specific features.
Abstract
Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure--especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper tackles robustness to sensor failure in collaborative perception, which is a realistic and safety-critical issue that is rarely explored in existing multimodal collaborative perception literature. - The method is plug-and-play and demonstrated on two backbone frameworks (AttFusion and Pyramid Fusion), suggesting broad applicability. - Strong empirical validation: the paper presents comprehensive experiments across homogeneous/heterogeneous modality failures.
- While RD improves robustness as a data augmentation strategy, the 0.5 dropout probability is not justified theoretically or empirically. A sensitivity analysis would strengthen this design choice. - While LAMMA fusion is designed to handle missing modalities, how does it differ from simpler alternatives—such as replacing the missing modality’s feature with a zero tensor in existing fusion schemes?
The paper is, to the best of my knowledge, the first work in collaborative perception that explicitly tackles multimodal perception failure caused by missing modalities, with a particular focus on ensuring operability when only RGB inputs are available. It identifies the inconsistency between pre-fusion and post-fusion features as the main reason for performance collapse during modality failure, and introduces LAMMA, a length-adaptive fusion module designed to maintain semantic consistency unde
The real-world validation is limited, and the camera results are inconsistent. On DAIR-V2X, the camera-only performance remains poor (the authors attribute this to the limitations of single-view LSS). This weakens SiMO’s claim of addressing single-modality operability in practice. The paper should (a) include stronger single-view camera baselines, or (b) tone down the claims regarding the single-view camera setting. Comparisons with stronger or more recent modality-failure methods adapted to mu
1. This work is the first to address the problem of modality failure in the context of multi-agent collaborative perception, and it proposes an effective and practical solution through the SiMO framework. 2. The paper is supported by extensive experiments, complemented by insightful visualizations and in-depth data analyses. 3. The manuscript is clearly written and logically structured and the technical content is easy to follow.
1. The paper does not clearly explain the unique challenges of modality failure in the multi-agent collaborative perception (MACP) setting compared to the single-agent case. Similar modality failures could also appear in single-agent multi-modality scenarios. The authors should clearly explain why existing single-agent modality-robust methods are inadequate in the MACP context, and ideally include comparative experiments to substantiate this claim. 2. The training procedure is complex. It requi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Speech and dialogue systems · Underwater Vehicles and Communication Systems
