Glider: Global and Local Instruction-Driven Expert Router

Pingzhi Li; Prateek Yadav; Jaehong Yoon; Jie Peng; Yi-Lin Sung; Mohit Bansal; Tianlong Chen

arXiv:2410.07172·cs.LG·June 16, 2025

Glider: Global and Local Instruction-Driven Expert Router

Pingzhi Li, Prateek Yadav, Jaehong Yoon, Jie Peng, Yi-Lin Sung, Mohit Bansal, Tianlong Chen

PDF

Open Access 1 Repo 1 Video 4 Reviews

TL;DR

GLIDER introduces a multi-scale routing mechanism combining global semantic instructions and local token-level decisions, significantly improving expert selection for both seen and unseen tasks in model merging.

Contribution

It proposes a novel multi-scale routing framework leveraging LLM reasoning to enhance expert selection, addressing limitations of existing MoErging methods.

Findings

01

Improved held-in task performance with GLIDER.

02

Maintains strong generalization on unseen tasks.

03

Ablation studies confirm the effectiveness of multi-scale routing.

Abstract

The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to particular domains. This has enabled the creation of powerful and adaptive routing-based "Model MoErging" methods with the goal of using expert modules to create an aggregate system with improved performance or generalization. However, existing MoErging methods often prioritize generalization to unseen tasks at the expense of performance on held-in tasks, which limits its practical applicability in real-world deployment scenarios. We observe that current token-level routing mechanisms neglect the global semantic context of the input task. This token-wise independence hinders effective expert selection for held-in tasks, as routing decisions fail to incorporate the semantic properties of the task. To address this, we propose, Global and Local Instruction Driven…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

1. The insights of the paper are to be praised. 2. Very interesting topic and focus on the router optimization. 3. Use the big model and small model together to solve the problem 4. Experimental results are good.

Weaknesses

1. Writing and Presentation:The paper could benefit from some polishing. There are a number of typos and semantic issues, and the overall formatting could be improved for better readability. Additionally, some figures are a bit challenging to interpret. For instance, Figure 1 is only referenced in Appendix B but appears as the first figure in the Related Works section, which can disrupt the flow and clarity for the reader. 2. Clarity of Background and Concepts: The background and explanation of

Reviewer 02Rating 5Confidence 3

Strengths

* The paper leverages LLMs to generate semantic task descriptions, providing global context for routing decisions which is a unique approach not explored in previous routing methods * The paper well address the limitations of current approaches (focusing on either held-in or held-out tasks) and provides a novel solution integrating both.

Weaknesses

* The experiments focus solely on T5, an older encoder-decoder architecture. The effectiveness of GLIDER on modern decoder-only models (like GPT family, LLaMA, etc.) remains unproven, which is crucial given these are now the mainstream architectures for LLMs. * Table 1 lacks clarity on evaluation metrics and methodological details. Without clear metric definitions and evaluation protocols, it's difficult to fully assess and compare the reported improvements. * The routing design will bring extra

Reviewer 03Rating 3Confidence 4

Strengths

this work introduces a routing mechanism to tradeoff between local and global experts, to increase performance on held in tasks, without compromising capability to handle held out tasks. the goal is clear and the approach is simple (as it is heuristic in nature).

Weaknesses

# evaluation as this work has no theoretical basis, one would have expected a significantly larger experimental part to convince the reader of the generality of the approach on - a significant wider range of tasks (and possible comparison points beyond those adopted in Phatgoose), - further exhibiting a statistical relevant comparison of improvements this is not the case, so the paper execution is far from being convincing. Additionally, while the main advantage of this work is to increase

Reviewer 04Rating 3Confidence 3

Strengths

1) The core idea -- that incorporating global information of the specialization of finetuned expert models into local routing schemes can improve expert aggregation algorithms -- is intuitive and persuasive. 2) The use of an LLM to encode global semantic information of the overall expert specialization is a creative method for effectively integrating the required global context

Weaknesses

1) My first concern relates to the overall problem setting of the paper and its core motivation of improving performance on held-in tasks. The authors claim that existing MoErging methods often prioritize generalization to unseen tasks at the expense of performance on held-in tasks, and indeed in Table 1 the authors report as one of their main results that GLIDER significantly outperforms baselines on held-in tasks. However, performance on held-in tasks is deemed unimportant precisely because w

Code & Models

Repositories

unites-lab/glider
pytorchOfficial

Videos

Glider: Global and Local Instruction-Driven Expert Router· underline

Taxonomy

TopicsModular Robots and Swarm Intelligence · Multi-Agent Systems and Negotiation · Robotics and Automated Systems