MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
Yuxuan Lou, Kai Yang, Yang You

TL;DR
MoST introduces a novel multimodal language model that effectively integrates speech and text using a specialized mixture of experts architecture, enhancing cross-modal understanding and outperforming existing models.
Contribution
The paper proposes the first fully open-source speech-text large language model with a modality-aware mixture of experts architecture, improving multimodal learning and performance.
Findings
MoST outperforms comparable models on multiple benchmarks.
Modality-specific routing improves cross-modal understanding.
Shared experts facilitate effective information transfer.
Abstract
We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The modality-aware mixture of experts (MAMoE) provides a clean and intuitive way with modality-specific expert groups and a parallel shared expert for training a speech language model. 2. The experimental validation is solid and rigorous, including ablations on initialization with non-MoE LLM (Llama3.2 3B), and ablations without modality-specific experts or shared experts.
1. The partition of 50% of the initial text expert capacity to $\mathcal{E}_{audio}$ is a major structural change, but the division is simply based on index without any reliable partition mechanism. The hard 50% partition of experts may introduce a risk of losing valuable text knowledge. 2. The paper lacks text-only evaluations. 3. The paper frequently claims "efficiency" and "data efficiency" without direct, quantifiable metrics and experimentation. It feels like the efficiency claim is on
Using MoE to construct a large speech-text model is an interesting approach.
1. The motivation for using a modality-aware router is unclear, as modality representations are generally easy to distinguish. The necessity of MoE in this context is not well justified. 2. The comparisons of data and models in the paper are unclear. The description of the initialized large model is insufficient, and evaluations on Llama Question S2T and Web Question are missing. Evaluations of text-based foundational models are also lacking. 3. Additionally, the model weights are not open-sou
- This paper introduces a novel Modality-Aware Mixture of Experts (MAMoE) architecture and a highly data-efficient Text-Speech Transformation Pipeline, which skillfully adapts a pretrained LLM into a powerful speech-text model through targeted post-training and instruction tuning. - The paper's most illustrations are commendable for their clarity. - The commitment to fully open-sourcing the work provides a valuable asset to the research community.
- The modality-aware routing relies on deterministic, tag-based assignment to expert groups, which closely resembles a two-tower architecture and may underutilize MoE's core strength of dynamic, content-based routing. - Based on the results presented in Table 1 and Table 2, while the proposed MoST model consistently outperforms the baselines across several benchmarks, the margin of improvement appears to be somewhat limited on certain metrics.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
