Methods for Quantifying Dataset Similarity: a Review, Taxonomy and Comparison
Marieke Stolte, Franziska Kappenberg, J\"org Rahnenf\"uhrer, Andrea Bommert

TL;DR
This paper reviews and classifies over 100 methods for quantifying dataset similarity, providing a comprehensive taxonomy, comparison, and practical recommendations for selecting appropriate measures in various applications.
Contribution
It offers the first extensive taxonomy and comparison of dataset similarity measures, including an online tool for method selection.
Findings
Classified 118 methods into 10 categories
Compared methods based on applicability, interpretability, and properties
Provided guidelines for choosing similarity measures
Abstract
Quantifying the similarity between datasets has widespread applications in statistics and machine learning. The performance of a predictive model on novel datasets, referred to as generalizability, depends on how similar the training and evaluation datasets are. Exploiting or transferring insights between similar datasets is a key aspect of meta-learning and transfer-learning. In simulation studies, the similarity between distributions of simulated datasets and real datasets, for which the performance of methods is assessed, is crucial. In two- or -sample testing, it is checked, whether the underlying distributions of two or more datasets coincide. Extremely many approaches for quantifying dataset similarity have been proposed in the literature. We examine more than 100 methods and provide a taxonomy, classifying them into ten classes. In an extensive review of these methods the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Statistical Methods and Models · Statistical Methods and Inference
