What should an AI assessor optimise for?

Daniel Romero-Alvarado; Fernando Mart\'inez-Plumed; Jos\'e; Hern\'andez-Orallo

arXiv:2502.00365·cs.LG·February 4, 2025

What should an AI assessor optimise for?

Daniel Romero-Alvarado, Fernando Mart\'inez-Plumed, Jos\'e, Hern\'andez-Orallo

PDF

Open Access 3 Reviews

TL;DR

This paper investigates whether training an AI assessor directly on a target metric is always optimal or if alternative metrics can lead to better predictions, revealing surprising results about metric choices.

Contribution

It experimentally explores the impact of training assessors on different metrics and shows that optimizing for more informative metrics is not always advantageous.

Findings

01

Optimizing for the target metric is not always best.

02

Some monotonic transformations improve assessor performance.

03

Logarithmic scores benefit classification score maximization.

Abstract

An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system. Assessors can lever-age information from the test results of many other AI systems and have the flexibility of be-ing trained on any loss function or scoring rule: from squared error to toxicity metrics. Here we address the question: is it always optimal to train the assessor for the target metric? Or could it be better to train for a different metric and then map predictions back to the target metric? Us-ing twenty regression and classification problems with tabular data, we experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings and find that, contrary to intuition, optimising for more informative met-rics is not generally better. Surprisingly, some monotonic…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

1. **Innovative Approach:** The paper addresses a critical gap in the understanding of how assessors can be trained, providing a fresh perspective on the relationship between the optimization of assessor models and the performance metrics of interest. 2. **Comprehensive Experimental Design:** The authors conduct several experiments that systematically evaluate the impact of different training metrics on assessor performance, leading to insightful findings. 3. **Interesting Findings:** The consis

Weaknesses

1. **Limited Metric Scope:** While the paper examines several regression metrics, it may benefit from a broader exploration of other types of metrics across different applications, such as classification or ranking tasks, to enhance the generalizability of the findings. 2. **Theoretical Foundation:** The paper would be strengthened by a more robust theoretical explanation of why certain metrics outperform others in the context of assessor training, particularly for the surprising results related

Reviewer 02Rating 5Confidence 4

Strengths

S1: The exploration of what loss functions are most effective for a target lost is generally important and can be informative if best choices are known and documented for practitioners. S2: The work uses a variety of tabular models and datasets.

Weaknesses

W1: The work engages only with regression models, which makes the scope somewhat narrow. W2: The work does show some interesting outcomes, with some of them summarized in Figure 8. The results are however exclusively explained from an empirical perspective only. There is no theoretical discussion about for example how ML optimization and different optimization algorithms may interfere and interact with the results. For example, the fact that signed simple error -> signed squared error is green

Reviewer 03Rating 5Confidence 3

Strengths

I think the paper has several strengths: 1. The explanations of concepts and methodology is quite clear. I like the use of diagrams such as Figures 1, 3, 4 and 8. 2. The number of datasets used is quite commendable. 10 seems like a sufficient number to generate some useful trends if there was a large effect at play. 3. The subject matter is of interest although I think the focus on tree-based methods restricts the utility of the paper.

Weaknesses

I do not think this paper is sufficiently novel and broadly applicable in order to warrant publicaton in ICLR. I believe there are several weaknesses: 1. Unless I missed something, I believe all models used in the training set for the assessors are tree-based and are not neural networks. I know that tree-based methods can be better for tabular data but the most interesting application of assessors (to me at least) is neural networks. The most interesting assessor models are the ones that produc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Explainable Artificial Intelligence (XAI)