Self-Judge: Selective Instruction Following with Alignment   Self-Evaluation

Hai Ye; Hwee Tou Ng

arXiv:2409.00935·cs.CL·September 4, 2024

Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

Hai Ye, Hwee Tou Ng

PDF

Open Access 1 Repo

TL;DR

This paper introduces Self-J, a self-training framework for developing judge models that predict response quality in large language models, enabling selective instruction following and improving alignment without requiring human-annotated data.

Contribution

The paper presents a novel self-evaluation method for LLMs that leverages self-distillation and gold references to improve response quality assessment without human labels.

Findings

01

Self-J correlates better with GPT-4 than strong baselines.

02

Judge models improve reward modeling for instruction-following tasks.

03

Method demonstrates strong generalization across domains.

Abstract

Pre-trained large language models (LLMs) can be tailored to adhere to human instructions through instruction tuning. However, due to shifts in the distribution of test-time data, they may not always execute instructions accurately, potentially generating factual errors or misaligned content when acting as chat assistants. To enhance the reliability of LLMs in following instructions, we propose the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low. We train judge models that can predict numerical quality scores for model responses. To address data scarcity, we introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores. Our method leverages the model's inherent self-evaluation capability to extract information about response quality from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nusnlp/Self-J
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Resource Development and Performance Evaluation

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Absolute Position Encodings · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection · Linear Warmup With Cosine Annealing · Transformer