MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient   Language Modeling

Abdoul Majid O. Thiombiano; Brahim Hnich; Ali Ben Mrad; Mohamed Wiem; Mkaouer

arXiv:2505.01459·cs.CL·May 6, 2025

MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling

Abdoul Majid O. Thiombiano, Brahim Hnich, Ali Ben Mrad, Mohamed Wiem, Mkaouer

PDF

Open Access

TL;DR

MoxE combines xLSTM with a mixture of experts and entropy-aware routing to improve efficiency and scalability in large language models, balancing resource use and token handling.

Contribution

It introduces a novel entropy-based routing mechanism within a mixture of xLSTM experts, enhancing efficiency and balancing in large language models.

Findings

01

Significant efficiency improvements over existing models

02

Effective handling of rare and common tokens

03

Robust training with auxiliary entropy and group-wise losses

Abstract

This paper introduces MoxE, a novel architecture that synergistically combines the Extended Long Short-Term Memory (xLSTM) with the Mixture of Experts (MoE) framework to address critical scalability and efficiency challenges in large language models (LLMs). The proposed method effectively leverages xLSTM's innovative memory structures while strategically introducing sparsity through MoE to substantially reduce computational overhead. At the heart of our approach is a novel entropy-based routing mechanism, designed to dynamically route tokens to specialized experts, thereby ensuring efficient and balanced resource utilization. This entropy awareness enables the architecture to effectively manage both rare and common tokens, with mLSTM blocks being favored to handle rare tokens. To further enhance generalization, we introduce a suite of auxiliary losses, including entropy-based and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsMixture of Experts · Multiplicative LSTM