TL;DR
MIN-Merging is a router-based framework that selectively merges important neurons in deep learning models, reducing parameter conflicts and improving in-domain performance while maintaining out-of-domain generalization.
Contribution
It introduces a novel neuron selection method for model merging that effectively mitigates parameter conflicts, enhancing performance across diverse tasks.
Findings
Achieves consistent in-domain performance gains
Retains generalization on out-of-domain tasks
Effective across CV and NLP benchmarks
Abstract
Recent advances in deep learning have led to a surge of open-source models across diverse domains. While model merging offers a promising way to combine their strengths, existing approaches often suffer from parameter conflicts that degrade performance on domain-specific tasks. We propose MIN-Merging, a router-based framework that selectively merges the most important neurons to reduce such conflicts. Extensive experiments on Computer Vision(CV) and Natural Language Processing(NLP) benchmarks show that MIN-Merging achieves consistent gains on in-domain tasks while retaining the generalization ability of pretrained models on out-of-domain tasks. These results highlight its effectiveness as a practical solution to the parameter conflict problem in model merging.
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well-structured and clearly written. The motivation for solving parameter conflicts is well-established, and the proposed three-stage solution (Enhancement, Routing, Merging) follows a logical and intuitive progression, making the overall argument easy to follow. 2. The paper adopts an intuitive and relatively simple approach to solve the foreseeable problem of parameter conflicts. The description of the router-based dynamic merging mechanism is clear, and the high-level idea of
1. The authors claimed that the problem of task conflict is solved by the method which seems kind of over-claimed. From my point of view, the methods only mitigate the problem by layer-wise reduction instead of solving it. 2. Figure 3 demonstrates that the fine-tuned model could sometimes perform better with certain layers dropped. This is a very interesting observation and need further explanation. 3. Some minor mistakes should be corrected. For example, in Figure 1, two images are identical
1. **Novel framing of neuron-level merging**. The paper frames a new approach that dynamically merges fine-tuned adapters by emphasizing neuron importance and task-aware routing. While not conceptually deep, the integration of pruning, routing, and dynamic weighting is novel within the LoRA merging context. 2. **Broad and strong empirical evaluation**. The experiments span both NLP (GLUE, MMLU) and CV (ViT-based) tasks, including small and large model scales. The results consistently show improv
1. **Misleading framing as "model merging"**. The method does not produce a single unified model; it keeps all LoRA adapters and linearly combines them at inference. This is conceptually closer to adapter ensembling than to true model merging. The paper should make this distinction explicit and mention that the method benefits from small size of LoRA adaptors. Because all expert adapters are retained, memory usage and inference cost scale with the number of tasks. This contradicts the claim of i
- The paper’s operational combination of (a) neuron/layer pruning to sharpen per‑task experts and (b) a top‑k router to drive input‑conditional merging is a tidy assembly of known ideas. The hierarchical “core vs. redundant” routing within merging is a mildly novel twist. - Cross‑domain scope (NLP + CV) and an attempt at ablations (removing filtering / hierarchical / router) shows some attention to component contribution. - If credible and reproducible, an input‑conditional merging recipe that’s
- The paper never gives a concrete, reproducible criterion for selecting “core” vs. “redundant” layers/neurons beyond descriptive language and a hand‑wavy SNR analogy. - Equations (4–5, 12) invoke SNR, mutual information, and entropy but there is no concrete estimation procedure or empirical SNR plots. - "Performance" values >100% appear in Tables 10–18; if they refer to accuracies, this is of course impossible. Please clarify this issue. - Competing methods (TIES‑Merging, Twin‑Merging, DARE, Ad
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
