Wisdom and Delusion of LLM Ensembles for Code Generation and Repair

Fernando Vallecillos-Ruiz; Max Hort; Leon Moonen

arXiv:2510.21513·cs.SE·October 31, 2025

Wisdom and Delusion of LLM Ensembles for Code Generation and Repair

Fernando Vallecillos-Ruiz, Max Hort, Leon Moonen

PDF

TL;DR

This paper empirically evaluates the complementarity of different LLMs for code generation and repair, demonstrating that diversity-based ensemble strategies can significantly outperform single models and consensus methods.

Contribution

It provides the first comprehensive comparison of multiple LLMs and ensemble strategies for software engineering tasks, highlighting the effectiveness of diversity-based selection.

Findings

01

Ensemble performance can be up to 83% higher than the best single model.

02

Consensus-based selection strategies often trap in popularity traps, favoring incorrect outputs.

03

Diversity-based strategies achieve up to 95% of the theoretical ensemble performance upper bound.

Abstract

Today's pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble's potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble's candidate pool. We find that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.