TL;DR
This paper introduces lightweight gating mechanisms inspired by the Free Energy Principle to improve mixture-of-experts models' ability to handle domain transitions, significantly enhancing expert routing accuracy.
Contribution
It proposes novel gate modifications based on the Free Energy Principle, demonstrating substantial improvements in expert assignment during domain shifts in MoE models.
Findings
Lightweight gates raise correct expert probability from 0.006 to 0.748.
Beta and anticipatory routing together close 75% of the oracle gap.
Beta routing reduces language model transition perplexity from 6.56 to 4.01.
Abstract
Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanisms draw from Friston's Free Energy Principle and use LIF dynamics from spiking neural networks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
