BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity

Harshita Diddee; Gregory Yauney; Swabha Swayamdipta; Daphne Ippolito

arXiv:2603.18019·cs.CL·April 10, 2026

BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity

Harshita Diddee, Gregory Yauney, Swabha Swayamdipta, Daphne Ippolito

PDF

TL;DR

BenchBrowser is a retrieval tool that surfaces relevant evaluation items across multiple benchmarks, helping practitioners verify if benchmarks truly measure intended language model capabilities.

Contribution

It introduces a retrieval system that provides evidence for benchmark validity, addressing the gap between benchmark content and practitioner goals.

Findings

01

High retrieval precision confirmed by human study

02

Helps diagnose low content and convergent validity of benchmarks

03

Quantifies the gap between benchmark tests and practitioner intent

Abstract

Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.