FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data
Tao Feng, Haozhen Zhang, Zijie Lei, Pengrui Han, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jiaxuan You

TL;DR
This paper introduces FusionFactory, a framework that leverages multi-LLM log data to effectively fuse diverse large language models, improving performance across multiple tasks and domains.
Contribution
It presents a novel systematic framework for multi-level LLM fusion and a large-scale benchmark to evaluate fusion strategies, addressing real-world deployment needs.
Findings
FusionFactory outperforms individual LLMs across all benchmarks.
Different fusion configurations are optimal for different tasks.
The approach demonstrates the practical potential of multi-LLM log data for model fusion.
Abstract
The rapid advancement of large language models (LLMs) has created a diverse landscape of models, each excelling at different tasks. This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable multi-LLM log data. This naturally leads to the question of whether such logs can be fully leveraged to fuse LLMs' complementary capabilities. Although prior work has explored various strategies for integrating multiple LLMs, we argue that practical fusion must meet two essential requirements: (1) compatibility with real-world serving scenarios (e.g., local and API-based serving), and (2) flexibility to operate at different stages of the LLM pipeline to meet varied user needs (e.g., fine-tuning and inference stages). To this end, we introduce LLMFusionBench, a large-scale benchmark for LLM fusion that spans 14 tasks across five domains, with responses from 20…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper introduces LLMFusionBench, a large-scale benchmark that compiles and standardizes multi-LLM log data—responses from diverse language models across multiple tasks and domains—to facilitate systematic studies of model capability fusion.
1.The paper offers limited to no novelty in term of methodology, as it mainly consolidates previously established techniques, such as routing, reasoning retrieval, and distillation, into a benchmark framework, and most of the findings reported from the proposed setup are already well-known in existing literature, e.g. any "fusion method" is better than vanilla LLM. 2. What are the direct strengths of the curated dataset here over previous datasets that also generate data from LLM for distillatio
1. Query-level fusion, Thought-level fusion, and Model-level fusion for LLMs are important. 2. Benchmark datasets are provided. 3. Experiments show the performance on different fusion levels.
1. The three fusion level is related but not very close. Each fusion level already has a few benchmark papers. It might be suitable for industry pipeline as an all-in-one pipeline for fine tuning a model while the research contribution may be limited. 2. Benchmark on each level is relatively simple and lacks in-depth research analysis. 3. In model-level fusion, the fine-tuned model performs worse than the zero-shot model.
(1)The work is motivated by the widespread practice of using multiple LLMs in real systems (e.g., API platforms, agentic workflows), which naturally generates valuable multi-LLM log data—making the research question highly relevant and actionable. (2)Introduction of LLMFusionBench which is a Comprehensive and Publicly Valuable Resource (3)The framework is designed to work in both local (weights accessible) and API-based (black-box) serving scenarios, addressing a critical gap in prior fusion
(1)The benchmark is constructed by actively querying 20 open-source LLMs with fixed prompts, rather than using real-world operational logs from actual multi-LLM deployments (e.g., user-facing API platforms). This synthetic setup may not reflect true usage patterns, query distributions, or failure modes seen in practice. (2)The paper introduces an LLM judge to score “insightfulness,” but this introduces potential circularity: the same type of model used in fusion is also used to evaluate it. Mor
1. It introduces LLMFusionBench, a large-scale, diverse, and well-structured benchmark covering 14 tasks across 6 domains, responses from 20 LLMs (8B-671B), and including critical metadata like performance, cost, and LLM Judge scores. 2. FusionFactory is an innovative and systematic framework that comprehensively explores fusion at three distinct stages—Query-level (Early), Thought-level (Mid), and Model-level (Late). This stage-aware design satisfies the requirement for practical flexibility an
1. For model-level fusion. The analysis should include more robust distillation or merging methods (e.g., parameter merging or logit-distillation for open-source models) to truly demonstrate the limit of model-level fusion using the logs, rather than just the limit of the chosen SFT strategy. 2. While the results claim Query-level fusion has minimal computational overhead, there is no dedicated, quantitative comparison of the latency or API cost of the three FusionFactory levels when deployed in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
