Temporally Extended Mixture-of-Experts Models

Zeyu Shen; Peter Henderson

arXiv:2604.20156·cs.LG·April 23, 2026

Temporally Extended Mixture-of-Experts Models

Zeyu Shen, Peter Henderson

PDF

1 Repo

TL;DR

This paper introduces temporally extended mixture-of-experts layers using an options framework, significantly reducing expert switching rates in large models while maintaining high accuracy, enabling more memory-efficient serving.

Contribution

It proposes a novel options-based approach for MoE models that learns when to switch experts, reducing switch rates from over 50% to below 5% with minimal retraining.

Findings

01

Switch rates reduced from over 50% to below 5%.

02

Achieved up to 90% of base-model accuracy on multiple benchmarks.

03

Lightweight training enables conversion of existing models to temporally extended MoEs.

Abstract

Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

princeton-polaris-lab/rl_moe
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.