Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

Yulong He; Ivan Smirnov; Dmitry Fedrushkov; Sergey Kovalchuk; Ilya Revin

arXiv:2604.03742·cs.AI·April 7, 2026

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

Yulong He, Ivan Smirnov, Dmitry Fedrushkov, Sergey Kovalchuk, Ilya Revin

PDF

1 Repo

TL;DR

This paper introduces a structured, uncertainty-aware evaluation framework for large language models using fuzzy AHP and a hybrid DualJudge system, improving reliability and consistency over traditional scoring methods.

Contribution

It adapts the Analytic Hierarchy Process to LLM evaluation, incorporating confidence scores and uncertainty modeling, and proposes DualJudge for enhanced assessment accuracy.

Findings

01

Fuzzy AHP outperforms direct scoring in model evaluation.

02

Uncertainty modeling improves judgment calibration.

03

DualJudge achieves state-of-the-art evaluation performance.

Abstract

Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hreyulog/AHP_llm_judge
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.