BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity
Harshita Diddee, Gregory Yauney, Swabha Swayamdipta, Daphne Ippolito

TL;DR
BenchBrowser is a retrieval tool that surfaces relevant evaluation items across multiple benchmarks, helping practitioners verify if benchmarks truly measure intended language model capabilities.
Contribution
It introduces a retrieval system that provides evidence for benchmark validity, addressing the gap between benchmark content and practitioner goals.
Findings
High retrieval precision confirmed by human study
Helps diagnose low content and convergent validity of benchmarks
Quantifies the gap between benchmark tests and practitioner intent
Abstract
Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
