Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

Meng Lou; Stanley Yu; Yizhou Yu

arXiv:2602.06862·cs.CV·May 21, 2026

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

Meng Lou, Stanley Yu, Yizhou Yu

PDF

1 Repo 4 Reviews

TL;DR

ParaX introduces a dynamic parameter routing method with shared expert centers for efficient, input-dependent adaptation of pre-trained vision models, enhancing performance on dense prediction tasks.

Contribution

It proposes ParaX, a novel adapter-style approach using shared expert centers and dynamic routing for improved parameter-efficient fine-tuning.

Findings

01

ParaX outperforms existing methods on various visual recognition tasks.

02

Dynamic weight matrices enable low-rank, input-dependent feature adaptation.

03

Shared expert centers promote cross-layer feature diversity.

Abstract

Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose ParaX, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each ParaX module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in ParaX modules…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. **Clear problem framing with concrete evidence.** The paper motivates representation deficiency (smaller ERFs) and feature redundancy (higher cross-layer CKA) for prior adapters and illustrates the proposed fix via shared expert centers and dynamic routing (Fig. 1–2). 2. **Simple, implementable mechanism.** The expert-center + router design is straightforward; the method section specifies shapes and routing steps, and extends naturally to multi-kernel depthwise convolutions (Fig. 3–4). 3.

Weaknesses

# Major Concerns 1. **Routing stability & expert semantics under-analyzed.** While the router composes per-input weights from a shared center, the paper gives limited analysis of expert specialization, collapse/over-use, or routing entropy (beyond softmax vs. sigmoid). It is unclear whether distinct experts truly encode complementary functions or if routing degenerates. Consider reporting expert usage histograms, KL/entropy of gates, and CKA across experts, not only layers. 2. **Compute/memory

Reviewer 02Rating 2Confidence 3

Strengths

- The AdaRoute method is simple yet effective. - The fine-tuning of vision models on large-scale dense prediction tasks is an important research topic.

Weaknesses

My main concern lies in the methodological overlap between this work and LoRand, as well as Mona, and in the lack of a strong motivation. The idea of using MoE to generate low-rank matrices has already been introduced in LoRand, while Mona enhances adapter performance in dense prediction through multi-scale convolutions. The proposed method in this paper appears to be a combination of these two approaches. Although the authors point out the limitations of prior work in terms of effective recept

Reviewer 03Rating 6Confidence 5

Strengths

1. The paper introduces a mixture-of-experts mechanism into the field of PEFT. By employing a routing network to aggregate shared expert matrices, it achieves input-dependent weight generation, enabling better adaptation to downstream tasks. Meanwhile, the shared expert center design implicitly enhances cross-layer feature interaction. 2. The AdaRoute module is lightweight and highly generalizable. It can dynamically generate parameters, effectively reducing redundant feature learning during fin

Weaknesses

1. The paper lacks an in-depth theoretical explanation and mathematical analysis of the proposed dynamic parameter routing mechanism. It does not clearly elaborate on the principles, optimization stability, or convergence properties underlying the method, making its theoretical foundation relatively weak. 2. Although the introduction of multi-scale convolutions enhances spatial modeling capability, it also introduces additional computational overhead, leading to training delays. Furthermore, th

Reviewer 04Rating 4Confidence 4

Strengths

1. The proposed router-expert structure is simple yet effective. It avoids the heavy computational cost of standard MoE while introducing input-conditioned flexibility. 2. By sharing a global expert center among AdaRoute modules, the method enables implicit cross-layer communication through jointly updated expert matrices. This design effectively reduces feature redundancy and encourages representation diversity. 3. The paper conducts comprehensive experiments on various vision tasks, including

Weaknesses

1. The paper compares AdaRoute with only a limited set of baseline methods. It lacks evaluation against several state-of-the-art PEFT approaches widely used for vision model adaptation, such as MLAE [1], SPT [2], DA-VPT [3], RepAdapter[4]. Including these baselines would provide a more convincing and comprehensive comparison. 2. Conceptually, the proposed method can be viewed as an adapter-based architecture augmented with a Mixture-of-Experts (MoE) gating mechanism. Although the authors introdu

Code & Models

Repositories

LMMMEng/ParaX
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis