TL;DR
AP-BMM introduces an asynchronous Bayesian approach guided by model differences to efficiently generate a diverse set of LLMs balancing reasoning ability and inference cost.
Contribution
It presents a novel asynchronous prior-guided Bayesian model merging method that improves Pareto set coverage and GPU utilization in multi-objective LLM model merging.
Findings
Achieves stronger Pareto-set quality under fixed evaluation budgets.
Broadens trade-off coverage compared to baseline methods.
Reduces wall-clock time by better GPU utilization.
Abstract
Serving Large Language Models (LLMs) often requires choosing between stronger reasoning and lower inference cost. Model merging offers a practical way to build several models between a reasoning-oriented model and a cheaper base model, but common model-level merging methods usually control this trade-off with only one or two global knobs. We study this setting as a multi-objective optimization problem: instead of producing one merged model, the goal is to find a set of merged models that cover different accuracy--token-cost preferences. Layer-wise merging is more flexible because it can assign different merge weights to different Transformer layers. However, it introduces two practical challenges. First, the layer-wise search space is large, and existing methods often search it without using helpful signals from the source models. Second, LLM evaluations can take very different amounts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
