Good practices for evaluation of machine learning systems
Luciana Ferrer, Odette Scharenborg, Tom B\"ackstr\"om

TL;DR
This paper emphasizes the importance of carefully designing evaluation protocols in machine learning, covering data, metrics, and significance to ensure reliable and generalizable results.
Contribution
It provides guidelines and examples for designing effective evaluation procedures in ML, highlighting common pitfalls and best practices.
Findings
Proper evaluation design prevents misleading conclusions
Careful data and metric selection improves result reliability
Statistical significance assessment is crucial for valid comparisons
Abstract
Many development decisions affect the results obtained from ML experiments: training data, features, model architecture, hyperparameters, test data, etc. Among these aspects, arguably the most important design decisions are those that involve the evaluation procedure. This procedure is what determines whether the conclusions drawn from the experiments will or will not generalize to unseen data and whether they will be relevant to the application of interest. If the data is incorrectly selected, the wrong metric is chosen for evaluation or the significance of the comparisons between models is overestimated, conclusions may be misleading or result in suboptimal development decisions. To avoid such problems, the evaluation protocol should be very carefully designed before experimentation starts. In this work we discuss the main aspects involved in the design of the evaluation protocol:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques · Neural Networks and Applications
