FuseChat: Knowledge Fusion of Chat Models
Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, Xiaojun Quan

TL;DR
FuseChat introduces a lightweight knowledge fusion framework that combines multiple chat LLMs of different architectures into a more capable model, reducing training costs and enhancing performance on instruction-following tasks.
Contribution
The paper proposes a novel two-stage knowledge fusion method for chat LLMs, including token alignment and parameter space merging, validated on diverse models and benchmarks.
Findings
FuseChat outperforms baseline models of similar size.
FuseChat approaches GPT-3.5-Turbo performance on MT-Bench.
The method is effective across diverse architectures and scales.
Abstract
While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The paper studies an interesting question of how to fuse multiple chat LLMs into a potent chat LLM. The paper is well-written and well-organized. 2. The paper has extensive experiments to investigate the effectiveness of their proposed framework and each component in their framework. 3. Their fusion method is also computation-friendly, which doesn't require additional training or dataset.
1. They didn't provide a significance test to show if their proposed method significantly outperforms their baselines (e.g. FuseLLM/OpenChat-3.5-7B Multi) or not. Because the improvement in some tasks is small, it would be better to show whether the improvement is significant. 2. Table 1's caption needs to be improved. It would be helpful if they clarified what bold font and underscore mean in their table and what the percentage means.
1. The motivation is practical and significant, offering a cost-effective solution for integrating capabilities of different heterogeneous LLMs without training new models from scratch. 2. The two-stage framework effectively combines heterogeneous model knowledge through distillation into homogeneous models followed by parameter merging, with a well-designed token alignment strategy. 3. Comprehensive experiments validate the framework's effectiveness, showing competitive performance against di
1. The paper's technical contribution appears somewhat limited. The approach can be viewed as a combination of pairwise FuseLLM and model merging (similar to TIES-Merging), both of which have been previously established as effective methods. The improved performance, while notable, follows logically from the combination of these known techniques, making the technical innovation less impressive than desired. 2. Several claims in the paper require further clarification. For instance, the statement
In general, the logic of the article is good, and the abstract, main text, and conclusions are consistent. The experiments are sufficiently convincing. The author summarizes the previous work from multiple aspects in the related work section.
1. In the Introduction section, there is insufficient explanation of the challenges faced by FUSECHAT. It is not enough to just explain the advantages of knowledge fusion, but the complexity of the work should also be highlighted. 2. The contribution of the work done in this paper is not explained in the Introduction section. 3. The method section uses too many narrative words and lacks specific formula expressions, which increases the difficulty for readers to understand the article. 4. In th
Code & Models
- 🤗yamatazen/Twilight-SCE-12Bmodel· 8 dl· ♡ 38 dl♡ 3
- 🤗Babsie/ThetaBlackGorgon-8Bmodel· 12 dl· ♡ 812 dl♡ 8
- 🤗FuseAI/OpenChat-3.5-7B-InternLM-v2.0model· 13 dl· ♡ 113 dl♡ 1
- 🤗FuseAI/OpenChat-3.5-7B-Qwen-v2.0model· 7 dl7 dl
- 🤗FuseAI/OpenChat-3.5-7B-Starling-v2.0model· 11 dl· ♡ 211 dl♡ 2
- 🤗FuseAI/FuseChat-7B-v2.0model· 16 dl· ♡ 1016 dl♡ 10
- 🤗FuseAI/OpenChat-3.5-7B-SOLAR-v2.0model· 7 dl· ♡ 17 dl♡ 1
- 🤗FuseAI/OpenChat-3.5-7B-Mixtral-v2.0model· 7 dl7 dl
- 🤗RichardErkhov/FuseAI_-_FuseChat-7B-v2.0-ggufmodel· 3 dl3 dl
- 🤗FuseAI/FuseChat-Qwen-2.5-7B-SFTmodel· 7 dl· ♡ 27 dl♡ 2
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Weight Decay · Adam · Byte Pair Encoding · Softmax · Dense Connections · Dropout · Linear Layer
