Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Shalini Maiti; Amar Budhiraja; Bhavul Gauri; Gaurav Chaurasia; Anton Protopopov; Alexis Audran-Reiss; Michael Slater; Despoina Magka; Tatiana Shavrina; Roberta Raileanu; Yoram Bachrach

arXiv:2511.13254·cs.CL·November 18, 2025

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach

PDF

Open Access 3 Models

TL;DR

This paper introduces SoCE, a novel model souping method that uses benchmark-based clustering and weighted averaging of models to significantly improve large language model performance across various domains.

Contribution

The paper presents a new principled approach for model souping that identifies category-specific experts and applies optimized weighted averaging, outperforming previous uniform-averaging methods.

Findings

01

Achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.

02

Improves robustness and performance across multilingual, tool calling, and math tasks.

03

Demonstrates the effectiveness of non-uniform weighted averaging in model souping.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques