MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models
Ahmad Chamma, Omar El Herraoui, Guokan Shang

TL;DR
MixtureKit is an open-source framework that simplifies the construction, training, and visualization of various Mixture-of-Experts models, enabling fine-grained routing and analysis for improved multilingual performance.
Contribution
It introduces a flexible, modular framework supporting multiple MoE methods, with automatic model modification, training, and visualization tools for research and development.
Findings
BTX-based models outperform dense baselines on multilingual benchmarks
Framework supports arbitrary pre-trained models and fine-tuning
Provides visualization tools for model interpretability
Abstract
We introduce MixtureKit, a modular open-source framework for constructing, training, and analyzing Mixture-of-Experts (MoE) models from arbitrary pre-trained or fine-tuned models. MixtureKit currently supports three complementary methods: (i) \emph{Traditional MoE}, which uses a single router per transformer block to select experts, (ii) \emph{BTX} (Branch-Train-Mix), which introduces separate routers for each specified sub-layer enabling fine-grained token routing, and (iii) \emph{BTS} (Branch-Train-Stitch), which keeps experts fully intact and introduces trainable stitch layers for controlled information exchange between hub and experts. MixtureKit automatically modifies the model configuration, patches decoder and causal LM classes, and saves a unified checkpoint ready for inference or fine-tuning. We further provide a visualization interface to inspect per-token routing decisions,…
Peer Reviews
Decision·Submitted to ICLR 2026
The difficulty of training MoE models has been a longstanding problem, and there is indeed a need for a library to simplify the MoE research process. The software, as described by the authors, seems to have a fairly simple and intuitive interface, created to be compatible with the HuggingFace library of models, and includes useful routing visualizations to support research.
In general, it's difficult to assess the strength of a software contribution. As a reviewer, it's impossible to know whether the implementation is truly as usable or intuitive as the authors describe. In line 381, the authors say "[t]his tool has proven valuable for…" but we do not receive any details about these scenarios where the tool has proven valuable While the paper includes one set of experiments from the authors, it lacks any other point of comparison to support its implementation quali
This appears to be a potentially very useful toolkit. Being able to use pre-trained experts and combine them in a much broader way than the current literature is very appealing.
I am not sure that the proposed toolkit actually accomplishes what it says. A bit more experimentation could be useful. For instance, in Table 1 the authors try to reproduce another paper using their method, but it is unclear to me whether or not the actually ever do. It would be nice to see this done, and perhaps one more paper as well using a different method and part of their toolkit to show it actually does everything they claim. This is not a paper that needs to beat SOTA results, but only
**Originality** The paper presents an open-source toolkit, **MixtureKit**, for composing and training Mixture-of-Experts (MoE) models from existing pretrained or fine-tuned checkpoints. **Quality** - The implementation appears **technically sound and modular**, providing a clear merging and patching pipeline compatible with the HuggingFace ecosystem. - The inclusion of a **visualization interface** for analyzing token routing and expert utilization adds usability value. However, there
1. **The motivation lacks depth and practical justification.** The paper positions MixtureKit as a “general framework for composing and training MoE models,” but fails to explain why such a framework is necessary beyond existing open-source systems (e.g., DeepSpeed-MoE, FairScale, Megatron-LM). There is no demonstrated bottleneck or pain point that MixtureKit specifically solves. As a result, the motivation appears weak and insufficiently grounded in practical or scientific needs. 2. **The
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
