TL;DR
This paper introduces ARJudge, a unified framework that adaptively combines text and code analyses for more robust and effective evaluation of LLM responses, surpassing previous methods.
Contribution
The paper presents ARJudge, a novel evaluation framework with a fine-tuned analyzer and a tuning-free refiner, improving robustness and adaptability in LLM response evaluation.
Findings
ARJudge outperforms existing evaluators in effectiveness.
ARJudge demonstrates enhanced robustness across diverse tasks.
Multi-faceted and code-driven analyses improve evaluation quality.
Abstract
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Residual Connection · Linear Layer · Absolute Position Encodings · Layer Normalization · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer
