Query-efficient model evaluation using cached responses

Hayden Helm; Ben Johnson; Carey Priebe

arXiv:2605.07096·cs.LG·May 11, 2026

Query-efficient model evaluation using cached responses

Hayden Helm, Ben Johnson, Carey Priebe

PDF

TL;DR

This paper presents a query-efficient method for evaluating models using cached responses by leveraging the Data Kernel Perspective Space (DKPS), reducing the number of queries needed for accurate benchmark performance prediction.

Contribution

It introduces DKPS-based techniques for model evaluation that are theoretically query-efficient and empirically achieve comparable accuracy with fewer queries.

Findings

01

DKPS-based methods match baseline accuracy with fewer queries

02

Theoretical proof of query efficiency under certain conditions

03

Offline query selection improves prediction accuracy

Abstract

Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.