A Case for Dataset Specific Profiling
Seth Ockerman, John Wu, Christopher Stewart

TL;DR
This paper advocates for dataset-specific profiling in scientific data analysis to improve model selection accuracy, highlighting that current benchmarking methods using representative datasets can mislead due to dataset-specific differences.
Contribution
It introduces the concept of dataset-specific profiling and demonstrates its importance in accurately ranking models for scientific datasets, challenging the reliance on representative datasets.
Findings
Dataset characteristics can significantly alter model rankings.
Lightweight model execution can enhance benchmarking accuracy.
Current benchmarking approaches may bias model selection.
Abstract
Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets. With modern machine learning frameworks, anyone can develop and execute computational models that reveal concepts hidden in the data that could enable scientific applications. For important and widely used datasets, computing the performance of every computational model that can run against a dataset is cost prohibitive in terms of cloud resources. Benchmarking approaches used in practice use representative datasets to infer performance without actually executing models. While practicable, these approaches limit extensive dataset profiling to a few datasets and introduce bias that favors models suited for representative datasets. As a result, each dataset's unique characteristics are left unexplored and subpar models are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Machine Learning in Materials Science
