Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework

Kaishuai Xu; Tiezheng Yu; Wenjun Hou; Yi Cheng; Liangyou Li; Xin Jiang; Lifeng Shang; Qun Liu; Wenjie Li

arXiv:2502.18874·cs.CL·May 28, 2025

Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework

Kaishuai Xu, Tiezheng Yu, Wenjun Hou, Yi Cheng, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, Wenjie Li

PDF

1 Video

TL;DR

This paper introduces ARJudge, a unified framework that adaptively combines text and code analyses for more robust and effective evaluation of LLM responses, surpassing previous methods.

Contribution

The paper presents ARJudge, a novel evaluation framework with a fine-tuned analyzer and a tuning-free refiner, improving robustness and adaptability in LLM response evaluation.

Findings

01

ARJudge outperforms existing evaluators in effectiveness.

02

ARJudge demonstrates enhanced robustness across diverse tasks.

03

Multi-faceted and code-driven analyses improve evaluation quality.

Abstract

Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework· underline

Taxonomy

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Residual Connection · Linear Layer · Absolute Position Encodings · Layer Normalization · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer