Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models
Alireza Amiri-Margavi, Iman Jebellat, Ehsan Jebellat, Seyed Pouyan, Mousavi Davoudi

TL;DR
This paper introduces a collaborative framework where multiple large language models generate and answer complex questions, demonstrating that inter-model consensus improves response reliability and question quality.
Contribution
The study presents a novel multi-model collaboration approach that enhances answer reliability and assesses question quality using statistical agreement measures.
Findings
Claude and GPT-4 produce high-quality, less ambiguous questions.
Inter-model consensus correlates with increased response reliability.
Gemini and LLaMA show greater variability and lower reliability.
Abstract
We propose a collaborative framework in which multiple large language models -- including GPT-4-0125-preview, Meta-LLaMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash -- generate and answer complex, PhD-level statistical questions when definitive ground truth is unavailable. Our study examines how inter-model consensus improves both response reliability and identifies the quality of the generated questions. Employing chi-square tests, Fleiss' Kappa, and confidence interval analysis, we quantify consensus rates and inter-rater agreement to assess both response precision and question quality. Key results indicate that Claude and GPT-4 produce well-structured, less ambiguous questions with a higher inter-rater agreement, as shown by narrower confidence intervals and greater alignment with question-generating models. In contrast, Gemini and LLaMA exhibit greater variability and lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsDense Connections · Label Smoothing · Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Adam · Residual Connection · Softmax · Attention Is All You Need
