Over-optimism in benchmark studies and the multiplicity of design and   analysis options when interpreting their results

Christina Nie{\ss}l (1); Moritz Herrmann (2); Chiara Wiedemann (1),; Giuseppe Casalicchio (2); Anne-Laure Boulesteix (1) ((1) Institute for; Medical Information Processing; Biometry; Epidemiology; LMU Munich,; Germany; (2) Department of Statistics; LMU Munich; Germany)

arXiv:2106.02447·stat.ME·January 17, 2024·WIREs Data Mining Knowl. Discov.

Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results

Christina Nie{\ss}l (1), Moritz Herrmann (2), Chiara Wiedemann (1),, Giuseppe Casalicchio (2), Anne-Laure Boulesteix (1) ((1) Institute for, Medical Information Processing, Biometry, Epidemiology, LMU Munich,, Germany, (2) Department of Statistics, LMU Munich, Germany)

PDF

1 Repo

TL;DR

This paper highlights how flexible choices in benchmark study design and analysis can lead to biased, overly optimistic results, emphasizing the need for careful, transparent research practices.

Contribution

It demonstrates the impact of multiple design and analysis options on benchmark results and advocates for awareness and transparency to improve reliability.

Findings

01

Benchmark results vary significantly with different design choices.

02

Multidimensional unfolding can assess the impact of analysis options.

03

Questionable practices can bias interpretations of benchmark studies.

Abstract

In recent years, the need for neutral benchmark studies that focus on the comparison of methods from computational sciences has been increasingly recognised by the scientific community. While general advice on the design and analysis of neutral benchmark studies can be found in recent literature, certain amounts of flexibility always exist. This includes the choice of data sets and performance measures, the handling of missing performance values and the way the performance values are aggregated over the data sets. As a consequence of this flexibility, researchers may be concerned about how their choices affect the results or, in the worst case, may be tempted to engage in questionable research practices (e.g. the selective reporting of results or the post-hoc modification of design or analysis components) to fit their expectations or hopes. To raise awareness for this issue, we use an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NiesslC/overoptimism_benchmark
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.