CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and   Evolution

Maosong Cao; Alexander Lam; Haodong Duan; Hongwei Liu; Songyang Zhang,; Kai Chen

arXiv:2410.16256·cs.CL·October 22, 2024

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang,, Kai Chen

PDF

Open Access 1 Repo 7 Models

TL;DR

CompassJudger-1 is an all-in-one open-source judge LLM designed to improve the accuracy, versatility, and reproducibility of model evaluations, supporting various assessment formats and tasks to advance LLM development.

Contribution

It introduces CompassJudger-1, the first versatile open-source judge model capable of multiple evaluation tasks, and establishes JudgerBench, a comprehensive benchmark for subjective evaluation tasks.

Findings

01

CompassJudger-1 demonstrates high versatility across evaluation tasks.

02

JudgerBench provides a unified platform for assessing judge models.

03

Open-sourcing accelerates research in LLM evaluation methodologies.

Abstract

Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce \textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-compass/compassjudger
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications · Scientific Computing and Data Management · Evolutionary Algorithms and Applications

MethodsSoftmax · Attention Is All You Need