An empirical study of testing machine learning in the wild
Moses Openja, Foutse Khomh, Armstrong Foundjem, Zhen Ming (Jack), Jiang, Mouna Abidi, Ahmed E. Hassan

TL;DR
This empirical study investigates real-world testing practices of machine learning systems, revealing common strategies, properties tested, and how testing varies across different ML applications and domains.
Contribution
First detailed empirical analysis of ML testing practices in open-source projects, highlighting prevalent strategies, properties tested, and domain-specific testing trends.
Findings
Grey-box and White-box testing are most common.
Only 20-30% of ML properties are frequently tested.
Bias and Fairness are tested more in Recommendation systems.
Abstract
Recently, machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems. Due to their inductive nature, ensuring the quality of these systems remains a significant challenge for the research community. Unlike traditional software built deductively by writing explicit rules, ML/DL systems infer rules from training data. Recent research in ML/DL quality assurance has adapted concepts from traditional software testing, such as mutation testing, to improve reliability. However, it is unclear if these proposed testing techniques are adopted in practice, or if new testing strategies have emerged from real-world ML deployments. There is little empirical evidence about the testing strategies. To fill this gap, we perform the first fine-grained empirical study on ML testing in the wild to identify the ML properties being tested, the testing strategies,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software System Performance and Reliability · Software Engineering Research
