Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts
Jingyu Zhang, Fan Wang, Jacky Keung, Yihan Liao, Yan Xiao, Lei Ma

TL;DR
This paper provides a comprehensive empirical evaluation of 15 test selection metrics across multiple testing objectives, OOD scenarios, data modalities, and models, addressing gaps in prior research.
Contribution
It introduces a large-scale benchmark and analysis framework to assess metric effectiveness under diverse, realistic testing conditions for DL systems.
Findings
Metrics vary significantly in effectiveness across scenarios
Certain metrics perform well for fault detection but poorly for performance estimation
The study highlights the importance of context-specific metric selection
Abstract
Deep learning (DL)-based systems can exhibit unexpected behavior when exposed to out-of-distribution (OOD) scenarios, posing serious risks in safety-critical domains such as malware detection and autonomous driving. This underscores the importance of thoroughly testing such systems before deployment. To this end, researchers have proposed a wide range of test selection metrics designed to effectively select inputs. However, prior evaluations of metrics reveal three key limitations: (1) narrow testing objectives, for example, many studies assess metrics only for fault detection, leaving their effectiveness for performance estimation unclear; (2) limited coverage of OOD scenarios, with natural and label shifts are rarely considered; (3) Biased dataset selection, where most work focuses on image data while other modalities remain underexplored. Consequently, a unified benchmark that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
