Language Models can Evaluate Themselves via Probability Discrepancy

Tingyu Xia; Bowen Yu; Yuan Wu; Yi Chang; Chang Zhou

arXiv:2405.10516·cs.CL·July 10, 2024

Language Models can Evaluate Themselves via Probability Discrepancy

Tingyu Xia, Bowen Yu, Yuan Wu, Yi Chang, Chang Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces ProbDiff, a novel self-evaluation method for LLMs that measures their performance by analyzing probability discrepancies in their responses, eliminating the need for external evaluators.

Contribution

It presents a new self-assessment technique for LLMs that uses the models' own probability outputs to evaluate their capabilities without external models.

Findings

01

ProbDiff correlates well with GPT-4 based evaluations.

02

It performs effectively across multiple NLP tasks and benchmarks.

03

The method is applicable to LLMs of different sizes.

Abstract

In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiatingyu/probdiff
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAttention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout