MoD: A Distribution-Based Approach for Merging Large Language Models
Quy-Anh Dang, Chris Ngo

TL;DR
The paper introduces MoD, a novel distribution-based method for merging large language models that preserves their specialization and improves performance over traditional weight-averaging techniques.
Contribution
MoD is a new framework that merges LLMs by operating on output distributions, enhancing knowledge sharing while maintaining model specialization.
Findings
MoD outperforms existing merging methods on mathematical reasoning benchmarks.
It effectively preserves individual model capabilities during merging.
Experimental results demonstrate significant performance improvements.
Abstract
Large language models (LLMs) have enabled the development of numerous specialized, task-specific variants. However, the maintenance and deployment of these individual models present substantial challenges in terms of resource utilization and operational efficiency. In this work, we propose the \textit{Mixture of Distributions (MoD)} framework, a novel approach for merging LLMs that operates directly on their output probability distributions, rather than on model weights. Unlike traditional weight-averaging methods, MoD effectively preserves the specialized capabilities of individual models while enabling efficient knowledge sharing across tasks. Through extensive experimentation on mathematical reasoning benchmarks using Qwen2.5 models, we demonstrate that MoD significantly outperforms existing model merging techniques across multiple benchmarks. All code, data, and experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
