TL;DR
This paper introduces PSO-Merging, a data-driven particle swarm optimization method for efficiently merging multiple language models, outperforming existing techniques in scalability and effectiveness.
Contribution
The paper proposes a novel PSO-based model merging approach that overcomes computational challenges of gradient-based methods and improves performance over existing gradient-free techniques.
Findings
PSO-Merging outperforms baseline methods in language model merging tasks.
The method is more scalable and computationally efficient.
Experimental results demonstrate improved model performance.
Abstract
Model merging has emerged as an efficient strategy for constructing multitask models by integrating the strengths of multiple available expert models, thereby reducing the need to fine-tune a pre-trained model for all the tasks from scratch. Existing data-independent methods struggle with performance limitations due to the lack of data-driven guidance. Data-driven approaches also face key challenges: gradient-based methods are computationally expensive, limiting their practicality for merging large expert models, whereas existing gradient-free methods often fail to achieve satisfactory results within a limited number of optimization steps. To address these limitations, this paper introduces PSO-Merging, a novel data-driven merging method based on the Particle Swarm Optimization (PSO). In this approach, we initialize the particle swarm with a pre-trained model, expert models, and…
Peer Reviews
Decision·Submitted to ICLR 2026
+ This paper proposes a novel model merging method, PSO-Merging. Experimental results show that PSO-Merging outperforms baseline methods in terms of average performance. + Compared with gradient-based approaches, it requires less memory, making it more practical for large-scale model merging.
- Generalization concerns remain. Although PSO-Merging achieves the highest average performance, these improvements are often driven by strong results on only a subset of tasks. In many cases, the method performs worse than some baselines at the per-task level (e.g., compared with DELLA-Merging in Table 1). This raises concerns about whether the approach consistently generalizes across diverse task types. - Efficiency evaluation is insufficient and lacks clarity. A key motivation of the work i
- Broad experiments across Flan-T5, LLaMA, and Mistral, covering GLUE, MATH, Instruct, and Code, compared with various methods like Task Arithmetic, DARE, TIES, and AdaMerging. - The PSO idea is interesting: it treats the entire model as a particle and can wrap around existing experts. - Can only use small labeled optimization sets: for GLUE 50 training samples per task, for LLaMA/Mistral a 1:10 split.
- W1: This method requires access to an external judge and task metrics during the swarm process. With $N$ tasks, the swarm has $2N+1$ particles (experts + sparsified experts + base). With $S$ steps (main results: $S=5$ for LLaMA/Mistral; $S=50$ for Flan‑T5), the number of scorings for each metric is $(2N+1) \times S$, which is costly when the judge is a large LLM or the experts are large LLMs. - W2: As this method requires test metric results, it risks data leakage and is unfair to compare wit
- The optimization algorithm is well-defined, gradient-free, and demonstrates rapid convergence in practice. - Extensive experiments across diverse model families and benchmarks, including GLUE and multitask setups, showing consistent improvements over multiple baselines. - The paper provides ablations on the momentum parameter, number of particles, and sparsification mechanism, supporting the method’s stability.
1. The relationship between PSO-Merging and prior swarm-based approaches (e.g., Model Swarms[7]) is mentioned but not discussed in depth, making it unclear what the methodological novelty is. 2. Training details of the expert models (especially for Llama-3 and Mistral) are not fully disclosed (the optimizer is not specified), limiting the reproducibility of the experiments. 3. The presentation suffers from grammatical issues, redundant explanations, and hard to parse notation (lines 16
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
