SedarEval: Automated Evaluation using Self-Adaptive Rubrics
Zhiyuan Fan, Weinong Wang, Xing Wu, Debing Zhang

TL;DR
SedarEval introduces a self-adaptive rubric-based evaluation paradigm for LLM outputs, creating detailed, question-specific scoring rubrics and training a specialized evaluator model that surpasses existing methods in accuracy and consistency.
Contribution
The paper presents a novel self-adaptive rubric framework and a comprehensive benchmark, SedarEval, with a trained evaluator LM that outperforms existing evaluation paradigms.
Findings
Evaluator LM achieves higher concordance with human grading than GPT-4.
SedarEval covers diverse domains including math, coding, and reasoning.
Self-adaptive rubrics improve evaluation precision and stability.
Abstract
The evaluation paradigm of LLM-as-judge gains popularity due to its significant reduction in human labor and time costs. This approach utilizes one or more large language models (LLMs) to assess the quality of outputs from other LLMs. However, existing methods rely on generic scoring rubrics that fail to consider the specificities of each question and its problem-solving process, compromising precision and stability in assessments. Inspired by human examination scoring processes, we propose a new evaluation paradigm based on self-adaptive rubrics. Specifically, we create detailed scoring rubrics for each question, capturing the primary and secondary criteria in a structured format of scoring and deduction points that mimic a human evaluator's analytical process. Building on this paradigm, we further develop a novel benchmark called SedarEval, which covers a range of domains including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsSoftmax · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Attention Is All You Need · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing
