Wisdom and Delusion of LLM Ensembles for Code Generation and Repair
Fernando Vallecillos-Ruiz, Max Hort, Leon Moonen

TL;DR
This paper empirically evaluates the complementarity of different LLMs for code generation and repair, demonstrating that diversity-based ensemble strategies can significantly outperform single models and consensus methods.
Contribution
It provides the first comprehensive comparison of multiple LLMs and ensemble strategies for software engineering tasks, highlighting the effectiveness of diversity-based selection.
Findings
Ensemble performance can be up to 83% higher than the best single model.
Consensus-based selection strategies often trap in popularity traps, favoring incorrect outputs.
Diversity-based strategies achieve up to 95% of the theoretical ensemble performance upper bound.
Abstract
Today's pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble's potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble's candidate pool. We find that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
