Stay Unique, Stay Efficient: Preserving Model Personality in Multi-Task Merging
Kuangpu Guo, Yuhe Ding, Jian Liang, Zilei Wang, and Ran He

TL;DR
This paper introduces DTS, a novel model merging framework that preserves task-specific information with minimal storage, outperforming existing methods and generalizing well to unseen tasks.
Contribution
DTS is an approximation-based personalized merging method using SVD and thresholding, with a variant for data-free generalization to unseen tasks.
Findings
DTS outperforms state-of-the-art baselines in preserving task performance.
DTS requires only 1% additional storage per task.
The variant of DTS generalizes effectively to unseen tasks.
Abstract
Model merging has emerged as a promising paradigm for enabling multi-task capabilities without additional training. However, existing methods often experience substantial performance degradation compared with individually fine-tuned models, even on similar tasks, underscoring the need to preserve task-specific information. This paper proposes Decomposition, Thresholding, and Scaling (DTS), an approximation-based personalized merging framework that preserves task-specific information with minimal storage overhead. DTS first applies singular value decomposition to the task-specific information and retains only a small subset of singular values and vectors. It then introduces a novel thresholding strategy that partitions singular vector elements into groups and assigns a scaling factor to each group. To enable generalization to unseen tasks, we further extend DTS with a variant that fuses…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1)Comprehensive empirical validation: The paper conducts large-scale experiments across multiple domains and backbones, including ViT-B/32, RoBERTa, GPT-2, and Qwen-14B (Sec. 4.1). Results consistently show that DTS outperforms both basic and personalized baselines while maintaining very low storage overhead. For example, DTS-D achieves 90.40% average accuracy on vision tasks with only 3.68% extra memory (Table 1), while achieving near-individual model performance on RoBERTa with 0.88% overhead
1)Lack of theoretical grounding for the method: Although DTS shows strong empirical performance, there is no theoretical analysis of its approximation quality or why the proposed thresholding and scaling preserve task identity. The manuscript does not provide error bounds for SVD truncation (Eq. 2) or theoretical justification for using four quantization groups (Sec. 3.2). 2)Inference cost ignored: Reconstructing $\hat{\theta}_u$ at every forward pass (Eq. (7)) adds matrix multiplications whose
1. Offer better trade-off between performance and storage efficiency than the baselines. 2. The presentation of the methods is clear and easy to follow. 3. Introduce the difference vector which make the method practical for the case when the pretrained is inaccessible. 4. Investigation on unseen tasks is appreciated especially for real world uses.
1. The authors do not evaluate computational overhead at inference time. Basic merging methods produce a single model, while personalized methods like DTS require a reconstruction of each task. Therefore, DTS should have higher computational overhead than basic merging and it would be valuable to evaluate this aspect too. 2. Some missing details (see questions)
- This paper proposes DTS for multi-task model merging. - The method is validated across multiple architectures and domains. - The paper is well-structured, and the method is easy to follow.
- The proposed method restores expert models by adding low-rank and sparse components to the pretrained model. However, this approach seems somewhat redundant-one could directly perform low-rank decomposition or sparsification on the expert model itself and store those parameters, eliminating the need to retain the pretrained model, thus saving storage. What are the advantages of the proposed approach compared with directly saving compact expert models? - In Equation (7), the method only utilize
- The singular-vector-level thresholding is methodologically novel and might be the main strength of the method. The 1% extra storage requirement is impressive, and is obtained by leveraging both low-rank approximation and quantization. - The method is data-free: there is no training, tuning or test-time adaptation involved. While these are often observed in the related literature, I strongly advise against using task-specific data to prevent limiting the practical adoption of the method.
- The method requires task information at inference time which is not obtained by routing. This makes the comparison with standard merging techniques unfair, as the model is expecting not only the classification head to be known, but also which task-specific parameters to use. While this assumption is present in some of the literature, it severely restricts the usefulness of the approach, and is often unrealistic in practice. - Lack of motivation. The performance degradation on similar tasks is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Topic Modeling
