Self-Taught Evaluators
Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane, Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li

TL;DR
This paper introduces a self-improving evaluation method for language models that uses synthetic data and iterative training, eliminating the need for costly human preference annotations.
Contribution
It presents a novel self-taught evaluation approach that trains an LLM-based judge solely on synthetic data, outperforming traditional human-annotated reward models.
Findings
Improves Llama3-70B-Instruct from 75.4 to 88.3 on RewardBench.
Outperforms GPT-4 as an evaluator.
Matches top reward models trained with human labels.
Abstract
Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on…
Peer Reviews
Decision·Submitted to ICLR 2025
- The self-taught framework presented in this paper is a quite novel. It uses an iterative self-improvement process that enables the model to independently refine its judgment skills, offering valuable insights for future research in self-supervised evaluation methods for LLMs. - The paper is well-structured and clearly written. - The proposed self-taught method could effectively reduces the dependency on human labeling by enabling the model to generate its own synthetic data and preference la
- The method has only been tested on one specific LLM variant (LLAMA3-70B-Instruct), making it unclear whether the approach generalizes well to other types of LLMs or models of different sizes and architectures. - The process for generating contrasting synthetic preference pairs is relatively simple and lacks refinement. The prompt used to generate suboptimal responses often results in fairly static patterns, and it remains unverified whether the generated response $y^l$ is indeed worse than th
- This paper presents an efficient method for an important problem of LLM evaluation. The method relies on synthetic data without the use of human data and thus it is easier to scale in practice. - The presentation of the paper is very clear and it is easy to follow and find relevant information - The proposed method is quite original, especially the component on generation of artificial pairs of responses where one element is constructed to be better than the other element. - The experimental s
- My main concern is about using *two* language models in the experiments. While the method is presented as a "self-taught" evaluator, the actual experiments rely on the use of two models: Mistral 22Bx8 Istruct for generating the initial synthetic responses, initial judgments and categorizing the queries, and Llama3-70B-Instruct for everything else. How is the need for the second model motivated? In this case, wouldn't it be the situation of distilling the knowledge of two LLMs into one rather t
1. **Novel Contribution**: The paper addresses a significant challenge in model evaluation by eliminating the need for costly human annotations, which can be both expensive and quickly outdated. 2. **Technical Implementation**: The proposed solution is an end-to-end pipeline that includes data curation, iterative synthetic data generation, and model training. The training produceure is also properly outlined. 3. **Performance**: The method shows decent performance, competing with human-annot
## Methodological Concerns **Reliability of LLM-as-a-Judge**: The paper does not thoroughly validate the reliability of various components in the LLM-as-a-Judge system used throughout the pipeline, raising concerns about the accuracy and consistency of judgments. > Line 230: To perform prompt selection, we annotate the category of each instruction with the Mixtral 22Bx8 Instruct model, using the template in Figure 7 and select 20,582 examples in the reasoning category, as we expect these to
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning and Data Classification
MethodsLinear Layer · Layer Normalization · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections
