Knowledge Fusion of Large Language Models Via Modular SkillPacks

Guodong Du; Zhuo Li; Xuanning Zhou; Junlin Li; Zesheng Shi; Wanyu Lin; Ho-Kin Tang; Xiucheng Li; Fangming Liu; Wenya Wang; Min Zhang; Jing Li

arXiv:2505.18502·cs.AI·February 27, 2026

Knowledge Fusion of Large Language Models Via Modular SkillPacks

Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, Jing Li

PDF

Open Access 1 Repo 3 Reviews

TL;DR

GraftLLM introduces SkillPacks, a modular knowledge transfer method for large heterogeneous language models, enabling efficient, scalable, and forget-free multi-capability fusion and continual learning.

Contribution

It proposes SkillPacks and a module-aware compression strategy for effective, scalable knowledge transfer and fusion in large heterogeneous LLMs, overcoming limitations of existing methods.

Findings

01

Outperforms existing techniques in knowledge transfer and fusion

02

Supports forget-free continual learning

03

Efficiently compresses knowledge for scalable deployment

Abstract

Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model's intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- I think the authors did a very good job of identifying a key gap: the lack of a method that both effectively absorbs deep knowledge from a source model (like distillation) and preserves the target model's inherent capabilities (which distillation often fails to do). - The method introduced is sound and the reframing of model fusion as a modular composition problem is quite novel. - A "forget-free" method for adding new, complex capabilities to a base model is highly sought after. This approac

Weaknesses

- I think the baseline comparisons is a bit ambiguous. The abstract positions the work against distillation and PEFTs, and mentions FuseLLM/FuseChat for small models. However, it's unclear if GraftLLM is benchmarked against current SOTA model merging techniques for large models. These methods also aim to fuse capabilities and are a crucial point of comparison. - The framework's practicality hinges on the cost of creating the SkillPacks. Can the authors please quantify the overhead of their meth

Reviewer 02Rating 4Confidence 4

Strengths

The paper tackles the highly relevant and challenging problem of fusing knowledge across heterogeneous LLMs. The proposed "module-aware adaptive compression" strategy is an intuitive and empirically effective contribution for creating compact, transferable knowledge modules. The experimental results consistently demonstrate strong performance across multiple benchmarks, outperforming several baselines in knowledge fusion.

Weaknesses

The technical contribution of this work appears to be incremental. The proposed pipeline—distilling knowledge, calculating a delta, compressing it, and then composing these modules—is conceptually very similar to existing frameworks like FuseChat (pairwise distillation followed by merging) and LoraHub (dynamic composition of LoRA modules). The main novelty seems to lie in the "module-aware adaptive compression" strategy, but the overall framework feels like a combination of established technique

Reviewer 03Rating 2Confidence 3

Strengths

The paper addresses a well known problem when developing or adapting LLMs. Instead of developing a completely new model, the goal is to merge existing models. The paper proposes a new approach which preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning.

Weaknesses

The main issue with the paper is self contained which makes its review difficult. The paper used several key technical terms such as “cross-capability transfer”, “SkillPack”, MLP, etc. without providing the meaning, which can make the paper very difficult to understand by newcomers. The problem addressed by the paper is not clear and there is a confusion in the paper contribution: - The paper is addressing the problem when models are structurally identical - What is the difference between heter

Code & Models

Repositories

duguodong7/graftllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus · Knowledge Distillation