MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models

Ahmad Chamma; Omar El Herraoui; Guokan Shang

arXiv:2512.12121·cs.LG·December 16, 2025

MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models

Ahmad Chamma, Omar El Herraoui, Guokan Shang

PDF

Open Access 3 Reviews

TL;DR

MixtureKit is an open-source framework that simplifies the construction, training, and visualization of various Mixture-of-Experts models, enabling fine-grained routing and analysis for improved multilingual performance.

Contribution

It introduces a flexible, modular framework supporting multiple MoE methods, with automatic model modification, training, and visualization tools for research and development.

Findings

01

BTX-based models outperform dense baselines on multilingual benchmarks

02

Framework supports arbitrary pre-trained models and fine-tuning

03

Provides visualization tools for model interpretability

Abstract

We introduce MixtureKit, a modular open-source framework for constructing, training, and analyzing Mixture-of-Experts (MoE) models from arbitrary pre-trained or fine-tuned models. MixtureKit currently supports three complementary methods: (i) \emph{Traditional MoE}, which uses a single router per transformer block to select experts, (ii) \emph{BTX} (Branch-Train-Mix), which introduces separate routers for each specified sub-layer enabling fine-grained token routing, and (iii) \emph{BTS} (Branch-Train-Stitch), which keeps experts fully intact and introduces trainable stitch layers for controlled information exchange between hub and experts. MixtureKit automatically modifies the model configuration, patches decoder and causal LM classes, and saves a unified checkpoint ready for inference or fine-tuning. We further provide a visualization interface to inspect per-token routing decisions,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The difficulty of training MoE models has been a longstanding problem, and there is indeed a need for a library to simplify the MoE research process. The software, as described by the authors, seems to have a fairly simple and intuitive interface, created to be compatible with the HuggingFace library of models, and includes useful routing visualizations to support research.

Weaknesses

In general, it's difficult to assess the strength of a software contribution. As a reviewer, it's impossible to know whether the implementation is truly as usable or intuitive as the authors describe. In line 381, the authors say "[t]his tool has proven valuable for…" but we do not receive any details about these scenarios where the tool has proven valuable While the paper includes one set of experiments from the authors, it lacks any other point of comparison to support its implementation quali

Reviewer 02Rating 4Confidence 3

Strengths

This appears to be a potentially very useful toolkit. Being able to use pre-trained experts and combine them in a much broader way than the current literature is very appealing.

Weaknesses

I am not sure that the proposed toolkit actually accomplishes what it says. A bit more experimentation could be useful. For instance, in Table 1 the authors try to reproduce another paper using their method, but it is unclear to me whether or not the actually ever do. It would be nice to see this done, and perhaps one more paper as well using a different method and part of their toolkit to show it actually does everything they claim. This is not a paper that needs to beat SOTA results, but only

Reviewer 03Rating 2Confidence 5

Strengths

**Originality** The paper presents an open-source toolkit, **MixtureKit**, for composing and training Mixture-of-Experts (MoE) models from existing pretrained or fine-tuned checkpoints. **Quality** - The implementation appears **technically sound and modular**, providing a clear merging and patching pipeline compatible with the HuggingFace ecosystem. - The inclusion of a **visualization interface** for analyzing token routing and expert utilization adds usability value. However, there

Weaknesses

1. **The motivation lacks depth and practical justification.** The paper positions MixtureKit as a “general framework for composing and training MoE models,” but fails to explain why such a framework is necessary beyond existing open-source systems (e.g., DeepSpeed-MoE, FairScale, Megatron-LM). There is no demonstrated bottleneck or pain point that MixtureKit specifically solves. As a result, the motivation appears weak and insufficiently grounded in practical or scientific needs. 2. **The

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications