TL;DR
This paper introduces Test-Time Model Merging (TTMM), a method that enables scaling of Mixture of Experts models to many more experts with minimal test-time overhead, approximating test-time training efficiently.
Contribution
The paper proposes TTMM, a novel approach that merges experts to scale MoE models significantly while maintaining low test-time costs, approximating the benefits of test-time training.
Findings
TTMM performance improves with more experts.
TTMM approaches TTT performance as experts increase.
TTMM is over 100x faster than TTT at test-time for 1B models.
Abstract
Mixture of expert (MoE) models are a promising approach to increasing model capacity without increasing inference cost, and are core components of many state-of-the-art language models. However, current MoE models typically use only few experts due to prohibitive training and inference cost. We propose Test-Time Model Merging (TTMM) which scales the MoE paradigm to an order of magnitude more experts and uses model merging to avoid almost any test-time overhead. We show that TTMM is an approximation of test-time training (TTT), which fine-tunes an expert model for each prediction task, i.e., prompt. TTT has recently been shown to significantly improve language models, but is computationally expensive. We find that performance of TTMM improves with more experts and approaches the performance of TTT. Moreover, we find that with a 1B parameter base model, TTMM is more than 100x faster than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMixture of Experts · Balanced Selection
