TL;DR
ParaX introduces a dynamic parameter routing method with shared expert centers for efficient, input-dependent adaptation of pre-trained vision models, enhancing performance on dense prediction tasks.
Contribution
It proposes ParaX, a novel adapter-style approach using shared expert centers and dynamic routing for improved parameter-efficient fine-tuning.
Findings
ParaX outperforms existing methods on various visual recognition tasks.
Dynamic weight matrices enable low-rank, input-dependent feature adaptation.
Shared expert centers promote cross-layer feature diversity.
Abstract
Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose ParaX, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each ParaX module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in ParaX modules…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. **Clear problem framing with concrete evidence.** The paper motivates representation deficiency (smaller ERFs) and feature redundancy (higher cross-layer CKA) for prior adapters and illustrates the proposed fix via shared expert centers and dynamic routing (Fig. 1–2). 2. **Simple, implementable mechanism.** The expert-center + router design is straightforward; the method section specifies shapes and routing steps, and extends naturally to multi-kernel depthwise convolutions (Fig. 3–4). 3.
# Major Concerns 1. **Routing stability & expert semantics under-analyzed.** While the router composes per-input weights from a shared center, the paper gives limited analysis of expert specialization, collapse/over-use, or routing entropy (beyond softmax vs. sigmoid). It is unclear whether distinct experts truly encode complementary functions or if routing degenerates. Consider reporting expert usage histograms, KL/entropy of gates, and CKA across experts, not only layers. 2. **Compute/memory
- The AdaRoute method is simple yet effective. - The fine-tuning of vision models on large-scale dense prediction tasks is an important research topic.
My main concern lies in the methodological overlap between this work and LoRand, as well as Mona, and in the lack of a strong motivation. The idea of using MoE to generate low-rank matrices has already been introduced in LoRand, while Mona enhances adapter performance in dense prediction through multi-scale convolutions. The proposed method in this paper appears to be a combination of these two approaches. Although the authors point out the limitations of prior work in terms of effective recept
1. The paper introduces a mixture-of-experts mechanism into the field of PEFT. By employing a routing network to aggregate shared expert matrices, it achieves input-dependent weight generation, enabling better adaptation to downstream tasks. Meanwhile, the shared expert center design implicitly enhances cross-layer feature interaction. 2. The AdaRoute module is lightweight and highly generalizable. It can dynamically generate parameters, effectively reducing redundant feature learning during fin
1. The paper lacks an in-depth theoretical explanation and mathematical analysis of the proposed dynamic parameter routing mechanism. It does not clearly elaborate on the principles, optimization stability, or convergence properties underlying the method, making its theoretical foundation relatively weak. 2. Although the introduction of multi-scale convolutions enhances spatial modeling capability, it also introduces additional computational overhead, leading to training delays. Furthermore, th
1. The proposed router-expert structure is simple yet effective. It avoids the heavy computational cost of standard MoE while introducing input-conditioned flexibility. 2. By sharing a global expert center among AdaRoute modules, the method enables implicit cross-layer communication through jointly updated expert matrices. This design effectively reduces feature redundancy and encourages representation diversity. 3. The paper conducts comprehensive experiments on various vision tasks, including
1. The paper compares AdaRoute with only a limited set of baseline methods. It lacks evaluation against several state-of-the-art PEFT approaches widely used for vision model adaptation, such as MLAE [1], SPT [2], DA-VPT [3], RepAdapter[4]. Including these baselines would provide a more convincing and comprehensive comparison. 2. Conceptually, the proposed method can be viewed as an adapter-based architecture augmented with a Mixture-of-Experts (MoE) gating mechanism. Although the authors introdu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
