Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Jiaju Chen; Yuxuan Lu; Xiaojie Wang; Huimin Zeng; Jing Huang; Jiri Gesi; Ying Xu; Bingsheng Yao; Dakuo Wang

arXiv:2507.21028·cs.CL·July 29, 2025

Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, Dakuo Wang

PDF

3 Reviews

TL;DR

This paper introduces MAJ-EVAL, a multi-agent framework that uses diverse LLM-based evaluators to simulate human multi-dimensional evaluation, improving alignment with human judgments in NLP assessments.

Contribution

The paper presents a novel framework for automatically creating diverse evaluator personas and multi-agent debates, enhancing the generalizability and accuracy of LLM-based evaluations.

Findings

01

MAJ-EVAL produces evaluations more aligned with human experts than traditional metrics.

02

The framework is effective in both educational and medical NLP domains.

03

Multi-agent debates improve the robustness of automated evaluations.

Abstract

Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging "LLM-as-a-judge" paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper addresses a significant and practical problem: the need for scalable, multi-dimensional evaluation of NLP systems that aligns with diverse human stakeholders. The authors' motivation for moving beyond single-agent "LLM-as-a-judge" systems is well-articulated.

Weaknesses

*Missing Comparison to LLM-Generated Personas* This is the most signifiant issue. The paper's core claim is that its document-grounded personas are superior to "arbitrary" ones. However, the authors fail to test this against a strong, obvious baseline. The ablation study only compares their method to "simple role definition" (e.g., "You are a school teacher") , which is an insufficient comparison. A proper baseline would be to **prompt the LLM to generate a detailed, expert persona** directly f

Reviewer 02Rating 4Confidence 3

Strengths

1. **Stakeholder‑grounded automatic persona creation.** The two‑step procedure (dimension extraction → persona instantiation with rich attributes) is novel in the evaluation context and increases face validity of agents’ judgments relative to ad‑hoc, hand‑written personas. 2. **Consistent human alignment on multiple dimensions.** Across StorySparkQA and MSLR‑Cochrane, MAJ‑EVAL achieves stronger correlations with human ratings than ROUGE/BERTScore, single‑LLM judges, and a prior multi‑agent bas

Weaknesses

1. **Judge backbones are limited.** Experiments use **only two** judge LLMs (Qwen‑3‑235B and Claude‑3.7‑Sonnet). This leaves open whether the gains are robust across families/scales (e.g., Llama‑3.x, GPT‑4‑series, Mistral‑Large). 2. **Correlation differences lack uncertainty analysis.** Several comparisons rely on visual gaps in heatmaps/tables. Confidence intervals, statistical tests (e.g., Zou’s method for comparing dependent correlations), or bootstrap CIs would substantiate claims. 3. **

Reviewer 03Rating 4Confidence 5

Strengths

The paper makes a timely contribution to LLM-based evaluation. I like that the authors automated agent generation step, which will improve replicability and objectivity. The results show improved reliability, making the approach conceptually novel and practically relevant for human-aligned evaluation design.

Weaknesses

Comment 1. The authors might want to check whether their persona extraction is valid. Several additional experiments could mitigate this concern: (1) The authors can randomly perturb the input corpus and track downstream ρ/τ changes against human ratings. (2) Although costly, the authors might consider recruiting several domain experts and asking them to verify the extracted dimensions. Additionally, domain experts can rate persona faithfulness and coverage. The authors would then be able to re

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.