MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models
Hongyu Wang, Jiayu Xu, Ruiping Wang, Yan Feng, Yitao Zhai, Peng Pei, Xunliang Cai, Xilin Chen

TL;DR
MoTE introduces a memory-efficient mixture-of-ternary-experts approach for large multimodal models, enabling comparable performance to full-precision models with significantly reduced memory footprint, suitable for edge deployment.
Contribution
The paper proposes training low-precision ternary experts instead of high-precision ones in MoEs, improving memory efficiency while maintaining performance.
Findings
MoTE achieves similar accuracy to full-precision MoE-LLaVA.
MoTE's memory footprint is significantly lower, especially when combined with post-training quantization.
Performance gains are notable under strict memory constraints.
Abstract
Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper presents a novel and practical architecture for memory-efficient MoE up-cycling. The core insight to retain the pre-trained FFN as a frozen, high-precision shared expert while only training new, low-precision experts is a well-motivated approach to balancing knowledge retention and memory efficiency. - The proposed MoTE framework achieves a compelling trade-off between performance and memory. It demonstrates performance comparable to a full-precision MoE baseline (MoE-LLaVA) at scal
- The paper does not state whether the experimental results (e.g., in Table 2 and Table 3) are from a single training run or averaged over multiple runs with different random seeds. This makes it difficult to assess the statistical reliability and robustness of the reported performance gains. - The paper compares MoTE primarily against the full-precision MoE-LLaVA baseline and that baseline with PTQ. It does not include comparisons to other potential memory-saving techniques, such as applying pa
1. The work presents a noteworthy improvement in memory utilization, achieving substantial efficiency gains relative to conventional approaches. 2. The study effectively combines MoE architecture with ternary quantization, showcasing a creative and technically sophisticated integration of two complex methodologies. 3. The paper offers valuable empirical guidance on maintaining training stability when applying aggressive quantization to large, sparse models—providing useful reference points for f
1. The paper’s exclusive focus on ternary (3-bit) quantization lacks adequate theoretical grounding and empirical validation. The rationale for this particular quantization level is not convincingly articulated, leaving the impression that the choice may stem from empirical convenience rather than a principled design objective. A clearer theoretical motivation is necessary to establish why the ternary configuration represents an optimal or essential component of the proposed architecture rather
1. The paper addresses a practically important issue, reducing memory and compute costs for large multimodal models for edge applications. 2. The idea of introducing a frozen shared expert to stabilize training and compensate for low-precision experts is conceptually simple and intuitive, and the method is easy to implement.
1. Questionable scalability and necessity: Table 1 shows that MoTE performs poorly on the 0.5B model but achieves better results on 1.5B and 3B models. The authors attribute this to the known trend that larger models are easier to quantize. While this explanation is plausible, it also weakens the claimed necessity of MoTE: if quantization robustness naturally improves with scale, it remains unclear whether MoTE itself contributes meaningfully to the observed gains, or if the improvement is large
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Human Mobility and Location-Based Analysis · Context-Aware Activity Recognition Systems
