TL;DR
This paper introduces MAJ-EVAL, a multi-agent framework that uses diverse LLM-based evaluators to simulate human multi-dimensional evaluation, improving alignment with human judgments in NLP assessments.
Contribution
The paper presents a novel framework for automatically creating diverse evaluator personas and multi-agent debates, enhancing the generalizability and accuracy of LLM-based evaluations.
Findings
MAJ-EVAL produces evaluations more aligned with human experts than traditional metrics.
The framework is effective in both educational and medical NLP domains.
Multi-agent debates improve the robustness of automated evaluations.
Abstract
Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging "LLM-as-a-judge" paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper addresses a significant and practical problem: the need for scalable, multi-dimensional evaluation of NLP systems that aligns with diverse human stakeholders. The authors' motivation for moving beyond single-agent "LLM-as-a-judge" systems is well-articulated.
*Missing Comparison to LLM-Generated Personas* This is the most signifiant issue. The paper's core claim is that its document-grounded personas are superior to "arbitrary" ones. However, the authors fail to test this against a strong, obvious baseline. The ablation study only compares their method to "simple role definition" (e.g., "You are a school teacher") , which is an insufficient comparison. A proper baseline would be to **prompt the LLM to generate a detailed, expert persona** directly f
1. **Stakeholder‑grounded automatic persona creation.** The two‑step procedure (dimension extraction → persona instantiation with rich attributes) is novel in the evaluation context and increases face validity of agents’ judgments relative to ad‑hoc, hand‑written personas. 2. **Consistent human alignment on multiple dimensions.** Across StorySparkQA and MSLR‑Cochrane, MAJ‑EVAL achieves stronger correlations with human ratings than ROUGE/BERTScore, single‑LLM judges, and a prior multi‑agent bas
1. **Judge backbones are limited.** Experiments use **only two** judge LLMs (Qwen‑3‑235B and Claude‑3.7‑Sonnet). This leaves open whether the gains are robust across families/scales (e.g., Llama‑3.x, GPT‑4‑series, Mistral‑Large). 2. **Correlation differences lack uncertainty analysis.** Several comparisons rely on visual gaps in heatmaps/tables. Confidence intervals, statistical tests (e.g., Zou’s method for comparing dependent correlations), or bootstrap CIs would substantiate claims. 3. **
The paper makes a timely contribution to LLM-based evaluation. I like that the authors automated agent generation step, which will improve replicability and objectivity. The results show improved reliability, making the approach conceptually novel and practically relevant for human-aligned evaluation design.
Comment 1. The authors might want to check whether their persona extraction is valid. Several additional experiments could mitigate this concern: (1) The authors can randomly perturb the input corpus and track downstream ρ/τ changes against human ratings. (2) Although costly, the authors might consider recruiting several domain experts and asking them to verify the extracted dimensions. Additionally, domain experts can rate persona faithfulness and coverage. The authors would then be able to re
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
