Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability
Ninad Naik

TL;DR
This paper presents a new ensemble validation framework for large language models that significantly improves factual accuracy and causal consistency, enhancing reliability for high-stakes autonomous AI applications.
Contribution
The paper introduces a novel ensemble-based content validation framework that leverages model consensus to improve LLM reliability without external knowledge or human oversight.
Findings
Precision increased from 73.1% to 93.9% with two models.
Precision reached 95.6% with three models.
Strong inter-model agreement (κ > 0.76) observed.
Abstract
Large Language Models (LLMs) have shown significant advances in text generation but often lack the reliability needed for autonomous deployment in high-stakes domains like healthcare, law, and finance. Existing approaches rely on external knowledge or human oversight, limiting scalability. We introduce a novel framework that repurposes ensemble methods for content validation through model consensus. In tests across 78 complex cases requiring factual accuracy and causal consistency, our framework improved precision from 73.1% to 93.9% with two models (95% CI: 83.5%-97.9%) and to 95.6% with three models (95% CI: 85.2%-98.8%). Statistical analysis indicates strong inter-model agreement ( > 0.76) while preserving sufficient independence to catch errors through disagreement. We outline a clear pathway to further enhance precision with additional validators and refinements. Although…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Smart Grid Security and Resilience · Software System Performance and Reliability
