Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

Ryo Bertolissi; Jonas H\"ubotter; Ido Hakimi; Andreas Krause

arXiv:2505.14136·cs.LG·July 31, 2025

Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

Ryo Bertolissi, Jonas H\"ubotter, Ido Hakimi, Andreas Krause

PDF

1 Repo

TL;DR

This paper introduces Test-Time Model Merging (TTMM), a method that enables scaling of Mixture of Experts models to many more experts with minimal test-time overhead, approximating test-time training efficiently.

Contribution

The paper proposes TTMM, a novel approach that merges experts to scale MoE models significantly while maintaining low test-time costs, approximating the benefits of test-time training.

Findings

01

TTMM performance improves with more experts.

02

TTMM approaches TTT performance as experts increase.

03

TTMM is over 100x faster than TTT at test-time for 1B models.

Abstract

Mixture of expert (MoE) models are a promising approach to increasing model capacity without increasing inference cost, and are core components of many state-of-the-art language models. However, current MoE models typically use only few experts due to prohibitive training and inference cost. We propose Test-Time Model Merging (TTMM) which scales the MoE paradigm to an order of magnitude more experts and uses model merging to avoid almost any test-time overhead. We show that TTMM is an approximation of test-time training (TTT), which fine-tunes an expert model for each prediction task, i.e., prompt. TTT has recently been shown to significantly improve language models, but is computationally expensive. We find that performance of TTMM improves with more experts and approaches the performance of TTT. Moreover, we find that with a 1B parameter base model, TTMM is more than 100x faster than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rbertolissi/ttmerge
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMixture of Experts · Balanced Selection