ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

Ruizhong Qiu; Hanqing Zeng; Yinglong Xia; Yiwen Meng; Ren Chen; Jiarui Feng; Dongqi Fu; Qifan Wang; Jiayi Liu; Jun Xiao; Xiangjun Fan; Benyu Zhang; Hong Li; Zhining Liu; Hyunsik Yoo; Zhichen Zeng; Tianxin Wei; Hanghang Tong

arXiv:2603.10160·cs.LG·March 12, 2026

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

Ruizhong Qiu, Hanqing Zeng, Yinglong Xia, Yiwen Meng, Ren Chen, Jiarui Feng, Dongqi Fu, Qifan Wang, Jiayi Liu, Jun Xiao, Xiangjun Fan, Benyu Zhang, Hong Li, Zhining Liu, Hyunsik Yoo, Zhichen Zeng, Tianxin Wei, Hanghang Tong

PDF

Open Access 3 Reviews

TL;DR

ReMix introduces a reinforcement learning-based routing mechanism for Mixture-of-LoRAs in LLM finetuning, addressing imbalance issues and significantly improving performance over existing methods.

Contribution

The paper proposes a novel non-learnable routing design with an unbiased reinforcement learning gradient estimator for Mixture-of-LoRAs, enhancing their expressive power.

Findings

01

ReMix outperforms state-of-the-art methods in parameter-efficient finetuning.

02

The non-learnable routing ensures balanced utilization of LoRAs.

03

The reinforcement learning approach scales well with increased training compute.

Abstract

Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

* Provides a theoretical proof showing that routing weights are almost surely imbalanced under standard conditions. * Introduces a framework guaranteeing at least $k$ active adapters at any given time. * Discusses implementation and procedural differences between fine-tuning and inference phases.

Weaknesses

* Unclear motivation: Having imbalanced routing is not necessarily undesirable. In practice, sparsity can be beneficial since less active adapters could be offloaded to slower memory tiers, improving efficiency. The paper lacks a deep discussion of why routing imbalance is inherently problematic. * Although the proposed method enforces at least $k$ active adapters, it does not guarantee diversity across selections. The same subset of $k$ adapters might always be activated together, which effec

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper is well-written and easy to follow. 2. The theoretical and empirical analysis are insightful.

Weaknesses

1. For each theorem statement, providing a high-level explanation of the proof structure and key intuitions would significantly improve readability. 2. The imbalance routing issue is well-known in the area of mixture-of-experts. Related work should also discuss mixture-of-experts, compare mixture-of-experts and mixture-of-LoRA, and discuss how people resolve the imbalance issue for mixture-of-experts.

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper is well organized and clearly structured, making it easy to follow. 2. It offers both theoretical and empirical analyses of the imbalance problem. 3. The figures and illustrations are clear and easy to interpret.

Weaknesses

1. The eq.(3) concerns only about utilization of LoRAs for each given input. So, it is sample-specific measurement, would your observations being biased due to samples? 2. In figure1, you tract only the routing weights of the last layer. Would the pattern be the same across layers? It is better to illustrate with a heat map where each layers is also included. 3. In section 3.1, you introduce hyper-parameters k which is the number of LoRAs you activated and /oemga which is the routing weights.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware-Defined Networks and 5G · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications