Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground Truth

Seyed Pouyan Mousavi Davoudi; Amin Gholami Davodi; Alireza Amiri-Margavi; Alireza Shafiee Fard; Mahdi Jafari

arXiv:2502.20758·stat.AP·August 12, 2025

Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground Truth

Seyed Pouyan Mousavi Davoudi, Amin Gholami Davodi, Alireza Amiri-Margavi, Alireza Shafiee Fard, Mahdi Jafari

PDF

TL;DR

This paper proposes a collaborative multi-model framework using diverse large language models to validate answers and assess question quality without relying on ground truth, leveraging inter-model agreement as a reliability indicator.

Contribution

It introduces a novel approach where multiple LLMs collaborate to produce and validate complex reasoning tasks without ground truth, using statistical measures to evaluate agreement and question quality.

Findings

01

Claude and Gemini produce more coherent questions with higher agreement.

02

LLAMA shows greater variability and lower consistency in question formulation.

03

Multi-model agreement correlates with answer reliability and question clarity.

Abstract

We introduce a new approach in which several advanced large language models-specifically GPT-4-0125-preview, Meta-LLAMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash-collaborate to both produce and answer intricate, doctoral-level probability problems without relying on any single "correct" reference. Rather than depending on an established ground truth, our investigation focuses on how agreement among diverse models can signal the reliability of their outputs and, by extension, reflect the overall quality of the generated questions. To measure this inter-model alignment, we apply a suite of statistical evaluations, including chi-square tests, Fleiss' Kappa coefficients, and confidence interval calculations, thereby capturing both precision in answers and clarity in question phrasing. Our analysis reveals that Claude and Gemini tend to frame questions more coherently and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLLaMA