Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment

Liang Wang; Junpeng Wang; Chin-chia Michael Yeh; Yan Zheng; Jiarui Sun; Xiran Fan; Xin Dai; Yujie Fan; Yiwei Cai

arXiv:2602.05110·cs.AI·February 6, 2026

Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment

Liang Wang, Junpeng Wang, Chin-chia Michael Yeh, Yan Zheng, Jiarui Sun, Xiran Fan, Xin Dai, Yujie Fan, Yiwei Cai

PDF

Open Access

TL;DR

This paper introduces a structured multi-evaluator framework to assess the reasoning quality and bias of Large Language Models in merchant risk assessment, providing insights into their reliability and alignment with human judgment.

Contribution

The paper presents a novel multi-evaluator framework with a consensus-deviation metric for evaluating LLM reasoning and bias in payment-risk settings, validated against human and real-world data.

Findings

01

LLMs show varying bias patterns, with some models exhibiting negative bias and others positive bias.

02

Anonymization reduces evaluator bias by approximately 25.8%.

03

Four models significantly align with payment-network data, confirming the framework's validity.

Abstract

Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment, combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generate and cross-evaluate MCC risk rationales under attributed and anonymized conditions. To establish a judge-independent reference, we introduce a consensus-deviation metric that eliminates circularity by comparing each judge's score to the mean of all other judges, yielding a theoretically grounded measure of self-evaluation and cross-model deviation. Results reveal substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet show negative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · FinTech, Crowdfunding, Digital Finance · Artificial Intelligence in Healthcare and Education