Algorithmically Establishing Trust in Evaluators
Adrian de Wynter

TL;DR
This paper introduces the No-Data Algorithm, a method to establish trust in evaluators like LLMs without needing labeled data, by challenging them repeatedly and analyzing their responses.
Contribution
It presents a provably correct algorithm that assesses evaluator trustworthiness without relying on reference data, suitable for low-resource language labeling.
Findings
The algorithm reliably accepts trustworthy evaluators after r challenges.
It effectively flags untrustworthy evaluators.
Empirical tests confirm theoretical guarantees.
Abstract
An evaluator, such as an LLM-as-a-judge, is trustworthy when there exists some agreed-upon way to measure its performance as a labeller. Traditional approaches either rely on testing the evaluator against references or assume that it `knows' somehow the correct labelling. Both approaches fail when references are unavailable: the former requires data, and the latter is an assumption, not evidence. To address this, we introduce the `No-Data Algorithm', which provably establishes trust in an evaluator without requiring any labelled data. Our algorithm works by successively posing challenges to said evaluator. We prove that after challenge rounds, it accepts an evaluator which knows the correct labels with probability , and reliably flags untrustworthy ones. We present formal proofs of correctness, empirical tests, and applications to assessing trust in LLMs-as-judges…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCryptography and Data Security · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI
