Knowledge Fusion of Large Language Models Via Modular SkillPacks
Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, Jing Li

TL;DR
GraftLLM introduces SkillPacks, a modular knowledge transfer method for large heterogeneous language models, enabling efficient, scalable, and forget-free multi-capability fusion and continual learning.
Contribution
It proposes SkillPacks and a module-aware compression strategy for effective, scalable knowledge transfer and fusion in large heterogeneous LLMs, overcoming limitations of existing methods.
Findings
Outperforms existing techniques in knowledge transfer and fusion
Supports forget-free continual learning
Efficiently compresses knowledge for scalable deployment
Abstract
Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model's intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model…
Peer Reviews
Decision·ICLR 2026 Poster
- I think the authors did a very good job of identifying a key gap: the lack of a method that both effectively absorbs deep knowledge from a source model (like distillation) and preserves the target model's inherent capabilities (which distillation often fails to do). - The method introduced is sound and the reframing of model fusion as a modular composition problem is quite novel. - A "forget-free" method for adding new, complex capabilities to a base model is highly sought after. This approac
- I think the baseline comparisons is a bit ambiguous. The abstract positions the work against distillation and PEFTs, and mentions FuseLLM/FuseChat for small models. However, it's unclear if GraftLLM is benchmarked against current SOTA model merging techniques for large models. These methods also aim to fuse capabilities and are a crucial point of comparison. - The framework's practicality hinges on the cost of creating the SkillPacks. Can the authors please quantify the overhead of their meth
The paper tackles the highly relevant and challenging problem of fusing knowledge across heterogeneous LLMs. The proposed "module-aware adaptive compression" strategy is an intuitive and empirically effective contribution for creating compact, transferable knowledge modules. The experimental results consistently demonstrate strong performance across multiple benchmarks, outperforming several baselines in knowledge fusion.
The technical contribution of this work appears to be incremental. The proposed pipeline—distilling knowledge, calculating a delta, compressing it, and then composing these modules—is conceptually very similar to existing frameworks like FuseChat (pairwise distillation followed by merging) and LoraHub (dynamic composition of LoRA modules). The main novelty seems to lie in the "module-aware adaptive compression" strategy, but the overall framework feels like a combination of established technique
The paper addresses a well known problem when developing or adapting LLMs. Instead of developing a completely new model, the goal is to merge existing models. The paper proposes a new approach which preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning.
The main issue with the paper is self contained which makes its review difficult. The paper used several key technical terms such as “cross-capability transfer”, “SkillPack”, MLP, etc. without providing the meaning, which can make the paper very difficult to understand by newcomers. The problem addressed by the paper is not clear and there is a confusion in the paper contribution: - The paper is addressing the problem when models are structurally identical - What is the difference between heter
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsFocus · Knowledge Distillation
