On the Assessment of Benchmark Suites for Algorithm Comparison
David Issa Mattos, Lucas Ruud, Jan Bosch, Helena Holmstr\"om Olsson

TL;DR
This paper introduces a statistical method based on item response theory to evaluate the effectiveness of benchmark suites in algorithm comparison, focusing on difficulty, discrimination, and informativeness.
Contribution
It applies a Bayesian IRT model to assess benchmark functions, providing a new way to evaluate and improve benchmark suite design for algorithm testing.
Findings
BBOB functions are generally difficult and poorly discriminate algorithms.
PBO functions are easier and have better discrimination but are less challenging.
IRT can guide the development of more effective benchmark suites.
Abstract
Benchmark suites, i.e. a collection of benchmark functions, are widely used in the comparison of black-box optimization algorithms. Over the years, research has identified many desired qualities for benchmark suites, such as diverse topology, different difficulties, scalability, representativeness of real-world problems among others. However, while the topology characteristics have been subjected to previous studies, there is no study that has statistically evaluated the difficulty level of benchmark functions, how well they discriminate optimization algorithms and how suitable is a benchmark suite for algorithm comparison. In this paper, we propose the use of an item response theory (IRT) model, the Bayesian two-parameter logistic model for multiple attempts, to statistically evaluate these aspects with respect to the empirical success rate of algorithms. With this model, we can assess…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms · Machine Learning and Data Classification · Sports Analytics and Performance
