Re-evaluating Evaluation
David Balduzzi, Karl Tuyls, Julien Perolat, Thore Graepel

TL;DR
The paper introduces Nash averaging as a new evaluation method for machine learning that adapts to redundancies, promoting inclusive and unbiased assessment across agents and tasks.
Contribution
It proposes Nash averaging, a novel evaluation framework based on game theory, to address biases caused by task and agent redundancies in machine learning assessments.
Findings
Nash averaging automatically adjusts for redundancies in evaluation data.
It promotes inclusive evaluation by mitigating biases from easy tasks or weak agents.
The approach is grounded in algebraic analysis of agent-vs-agent and agent-vs-task scenarios.
Abstract
Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation -- since…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications
