SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities
Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang

TL;DR
SMAR is a regularization method for multimodal MoE models that preserves language capabilities while enabling expert specialization across modalities, with minimal architecture changes and efficient training.
Contribution
We introduce SMAR, a novel KL divergence-based regularization technique that maintains language skills in multimodal MoE models without architectural modifications.
Findings
SMAR retains 86.6% of language ability with minimal text data.
Outperforms baseline methods in multimodal tasks.
Efficiently balances modality specialization and language retention.
Abstract
Mixture of Experts (MoE) architectures have become a key approach for scaling large language models, with growing interest in extending them to multimodal tasks. Existing methods to build multimodal MoE models either incur high training costs or suffer from degraded language capabilities when adapting pretrained models. To address this, we propose Soft ModalityAware Routing (SMAR), a novel regularization technique that uses Kullback Leibler divergence to control routing probability distributions across modalities, encouraging expert specialization without modifying model architecture or heavily relying on textual data. Experiments on visual instruction tuning show that SMAR preserves language ability at 86.6% retention with only 2.5% pure text, outperforming baselines while maintaining strong multimodal performance. Our approach offers a practical and efficient solution to balance…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The proposal introduces a unique method, SMAR, to manage expert specialization in MoE-based multimodal models. It creatively uses KL divergence to control routing across modalities, enhancing language capability retention without requiring architectural changes. The idea of soft modality-aware routing is innovative and addresses the crucial challenge of balancing modality differentiation with language performance. - The experimental setup is comprehensive, and the paper uses relevant and estab
- The paper uses VITA and MoE-LLaVA as the base models in the experiments. They are now considered outdated. The validity of the proposed method on these older models might not be representative of its potential on more advanced architectures. - The paper compares SMAR with load-balancing loss, but it lacks sufficient details about how load-balancing loss is applied in the experiments. There are no descriptions of the training parameters, settings, or any explanation of why this method is being
1) The paper studies and proposes a metric to quantify the routing strategies in MoE-based multimodal models. 2) The paper proposes SMAR to control expert modality differentiation. 3) The experiment and ablation are well designed with detailed analysis to study SMAR.
1) The SMAR does not generally improve multimodal capabilities, although it retains language capabilities better. Why? 2) There is a lack of comparative analysis with works mentioned in the intro and related works sections to show how this work challenges SOTA. 3) The abstract mentioned that 2.5% pure text was used, while the conclusion mentioned "without additional pure text"?
1. Clear and practically meaningful motivation: The paper pinpoints a real, widespread issue in LVLMs—when integrating visual capabilities via multimodal data, the underlying language competence of the LLM is degraded; addressing this is crucial for real-world deployment of LVLMs. 2. Diagnostic “metric” for modality-wise routing: The authors propose a novel MRD “metric” to evaluate routing probability distributions across modalities, providing a useful lens for analyzing routing strategies in Mo
1. Insufficient experimentation: Although the authors claim improvements on multiple metrics, the experimental section does not provide sufficient evidence. For example, in Table 1 across nine multimodal datasets, the proposed method is best on only four; on text datasets, it is best on only 4/8. In Table 2, across six datasets, only two are best. While the mean reportedly improves from 81.6% to 86.6%, if MBPP is excluded, the mean gain drops from 5% to 0.4%. If the improvements hold primarily o
The paper introduces a novel perspective on analyzing MoE routing behavior through MRD, which provides valuable insights into modality-specific expert specialization. The tolerance band approach for controlling expert differentiation is creative and theoretically motivated. The core methodology is clearly explained with appropriate mathematical notation. The MRD concept is well-motivated and the SMAR loss formulation is understandable. Addressing language capability degradation in multimodal tra
**Misleading Performance Claims**: The paper's main performance claims are problematic. The authors compare their MoE-based method against dense models (LLaVA-1.5) to claim improvements of 5.4%, 7.0%, 7.4%, etc., which is fundamentally unfair due to architectural differences. When compared against fair baselines with the same architecture, the multimodal improvements are marginal (typically <2%) and sometimes negative (e.g., VQAv2: 82.5→82.4). **Questionable Baseline Quality**: The baseline mod
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsMixture of Experts
