TL;DR
This paper introduces a structured, uncertainty-aware evaluation framework for large language models using fuzzy AHP and a hybrid DualJudge system, improving reliability and consistency over traditional scoring methods.
Contribution
It adapts the Analytic Hierarchy Process to LLM evaluation, incorporating confidence scores and uncertainty modeling, and proposes DualJudge for enhanced assessment accuracy.
Findings
Fuzzy AHP outperforms direct scoring in model evaluation.
Uncertainty modeling improves judgment calibration.
DualJudge achieves state-of-the-art evaluation performance.
Abstract
Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
