MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
Arian Raje, Anupam Nayak, Gauri Joshi

TL;DR
MELINOE is a fine-tuning approach that enhances memory efficiency in Mixture-of-Experts models by reducing expert transfer overhead, leading to significant throughput improvements without sacrificing performance.
Contribution
It introduces a fine-tuning method that biases MoE models to activate fewer experts, enabling caching and reducing transfer latency during inference.
Findings
Increases throughput by up to 14.7x over transfer-heavy baselines.
Maintains or improves downstream task performance.
Reduces expert transfer latency in MoE inference.
Abstract
Mixture-of-Experts (MoE) model architectures can significantly reduce the number of activated parameters per token, enabling computationally efficient training and inference. However, their large overall parameter counts and model sizes have precluded their widespread usage in resource-constrained settings as all of the parameters must still be loaded into GPU memory. Prior works aim to address this memory bottleneck by offloading certain experts into CPU memory and porting them to GPU memory only when they are activated. In practice, these methods suffer from the significant I/O latency incurred by expert transfer. We present MELINOE, a method that fine-tunes an MoE model to more strongly prefer activating a smaller number of experts per sequence. Caching these preferred experts in GPU memory reduces expert churn and CPU-GPU transfer overhead. MELINOE increases throughput by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Advanced Neural Network Applications
