Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question
Rafael Rosales, Santiago Miret

TL;DR
This paper compares two diversity strategies—model diversity and question interpretation diversity—for ensemble answering with large language models, finding that question interpretation diversity yields more accurate results.
Contribution
It provides an empirical comparison of diversity approaches in LLM ensembles, highlighting the effectiveness of question interpretation diversity over model diversity.
Findings
Question interpretation diversity outperforms model diversity in ensemble accuracy.
Model diversity results are often between the best and worst individual models.
Ensemble methods using question framing improve answer accuracy across datasets.
Abstract
Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
