MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition
Shu Zhao, Nilesh Ahuja, Tan Yu, Tianyi Shen, Vijaykrishnan Narayanan

TL;DR
MoRA is a parameter-efficient method that improves visual recognition with missing modalities by modeling cross-modal interactions and enabling knowledge transfer, achieving significant performance gains with minimal additional training and inference costs.
Contribution
MoRA introduces a novel low-rank adaptation approach that explicitly models cross-modal interactions and maintains modality-specific features for better missing-modality recognition.
Findings
Achieves 5.24% performance improvement in missing-modality scenarios.
Uses only 25.90% of inference time compared to SOTA.
Requires just 0.11% of trainable parameters of full fine-tuning.
Abstract
Pre-trained vision language models have shown remarkable performance on visual recognition tasks, but they typically assume the availability of complete multimodal inputs during both training and inference. In real-world scenarios, however, modalities may be missing due to privacy constraints, collection difficulties, or resource limitations. While previous approaches have addressed this challenge using prompt learning techniques, they fail to capture the cross-modal relationships necessary for effective multimodal visual recognition and suffer from inevitable computational overhead. In this paper, we introduce MoRA, a parameter-efficient fine-tuning method that explicitly models cross-modal interactions while maintaining modality-specific adaptations. MoRA introduces modality-common parameters between text and vision encoders, enabling bidirectional knowledge transfer. Additionally,…
Peer Reviews
Decision·ICLR 2026 Poster
* Inclusion of low rank updates for modality specific and shared parameters. * Inference time efficiency in terms of minimal computational overhead. * Superior performance across multiple benchmarks involving diverse missing modality scenarios. * Strong cross-scenario generalization in cases when training time missing-modality patterns differ significantly from the testing patterns.
* The proposed low rank update approach is restricted to the dual encoder models i.e. CLIP or SLIP. The applicability of the proposed approach should be explored for single stream models that relies on alignment of visual representations with LLM inputs. * The current set of analysis is restricted to image-text datasets with significant text modality dominance. Tasks involving other modality combinations ( audio-visual inputs ) can be considered e.g. audio-visual action recognition. * Situations
- The core mechanism of MoRA is clever. Using Gram matrices to facilitate shared, low-rank adaptation between two frozen encoders of different dimensions is a technical solution to the dimension-mismatch problem. - The primary advantage of MoRA over its main competitors (MMP, DCP) is its efficiency. Because MoRA is a LoRA-based method, all its adapter weights can be merged into the backbone model post-training.
- The central premise of the paper—that missing modalities are a common, critical "real-world scenario"—is weakly motivated. The paper justifies this by vaguely citing "privacy constraints, collection difficulties, or resource limitations" but provides no concrete, compelling examples. - In most practical VLM applications (VQA, captioning, retrieval), the user provides the modalities. The experimental scenarios, such as classifying a "hateful meme" with the text missing, or identifying food from
+ The paper is technically sound, well-motivated, and tackles the missing modality issue, which is an important and practical challenge in multimodal learning. + The proposed MoRA method is conceptually interesting and carefully designed, as it explicitly models cross-modal interactions while retaining modality-specific adaptations. + The proposed MoRA brings impressive improvements and consistently outperforms other SOTA methods across various missing-modality scenarios. + Extensive ablation s
- Scalability to more modalities (e.g., beyond two) is a potential concern. While MoRA appears feasible for dual-modality tasks such as image–text or audio–text learning, its current design may not scale efficiently to scenarios involving multiple modalities (e.g., 4 or more), where cross-modal adaptations could grow exponentially in cost and complexity. - The generalizability across multimodal architectures is not clearly demonstrated. The proposed MoRA is primarily evaluated on a CLIP-like arc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
