A Theoretical Framework for Adaptive Utility-Weighted Benchmarking
Philip Waggoner

TL;DR
This paper proposes a theoretical framework for adaptive, stakeholder-aware benchmarking in AI, enabling dynamic, context-sensitive evaluation that incorporates human preferences and evolves over time.
Contribution
It introduces a multilayer, adaptive network model for benchmarking that integrates stakeholder priorities and allows for dynamic evolution of evaluation metrics.
Findings
Formalizes human-in-the-loop benchmark updates
Generalizes classical leaderboards as a special case
Provides tools for analyzing benchmark structural properties
Abstract
Benchmarking has long served as a foundational practice in machine learning and, increasingly, in modern AI systems such as large language models, where shared tasks, metrics, and leaderboards offer a common basis for measuring progress and comparing approaches. As AI systems are deployed in more varied and consequential settings, though, there is growing value in complementing these established practices with a more holistic conceptualization of what evaluation should represent. Of note, recognizing the sociotechnical contexts in which these systems operate invites an opportunity for a deeper view of how multiple stakeholders and their unique priorities might inform what we consider meaningful or desirable model behavior. This paper introduces a theoretical framework that reconceptualizes benchmarking as a multilayer, adaptive network linking evaluation metrics, model components, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Adversarial Robustness in Machine Learning
