Inherent Trade-Offs between Diversity and Stability in Multi-Task   Benchmarks

Guanhua Zhang; Moritz Hardt

arXiv:2405.01719·cs.LG·May 7, 2024

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Guanhua Zhang, Moritz Hardt

PDF

Open Access 1 Repo

TL;DR

This paper explores the inherent trade-offs between diversity and stability in multi-task benchmarks in machine learning, revealing that increased diversity leads to greater sensitivity to irrelevant changes, with implications for benchmark design.

Contribution

It introduces a novel theoretical framework based on social choice theory, along with new quantitative measures and algorithms, to analyze and empirically demonstrate the diversity-stability trade-off in multi-task benchmarks.

Findings

01

A strong inverse relationship between diversity and sensitivity in benchmarks.

02

Existing benchmarks are highly unstable under irrelevant task changes.

03

New measures and algorithms for assessing diversity and sensitivity.

Abstract

We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

socialfoundations/benchbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics