Methods for Quantifying Dataset Similarity: a Review, Taxonomy and Comparison

Marieke Stolte; Franziska Kappenberg; J\"org Rahnenf\"uhrer; Andrea Bommert

arXiv:2312.04078·stat.ME·June 18, 2025·2 cites

Methods for Quantifying Dataset Similarity: a Review, Taxonomy and Comparison

Marieke Stolte, Franziska Kappenberg, J\"org Rahnenf\"uhrer, Andrea Bommert

PDF

Open Access

TL;DR

This paper reviews and classifies over 100 methods for quantifying dataset similarity, providing a comprehensive taxonomy, comparison, and practical recommendations for selecting appropriate measures in various applications.

Contribution

It offers the first extensive taxonomy and comparison of dataset similarity measures, including an online tool for method selection.

Findings

01

Classified 118 methods into 10 categories

02

Compared methods based on applicability, interpretability, and properties

03

Provided guidelines for choosing similarity measures

Abstract

Quantifying the similarity between datasets has widespread applications in statistics and machine learning. The performance of a predictive model on novel datasets, referred to as generalizability, depends on how similar the training and evaluation datasets are. Exploiting or transferring insights between similar datasets is a key aspect of meta-learning and transfer-learning. In simulation studies, the similarity between distributions of simulated datasets and real datasets, for which the performance of methods is assessed, is crucial. In two- or $k$ -sample testing, it is checked, whether the underlying distributions of two or more datasets coincide. Extremely many approaches for quantifying dataset similarity have been proposed in the literature. We examine more than 100 methods and provide a taxonomy, classifying them into ten classes. In an extensive review of these methods the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Statistical Methods and Models · Statistical Methods and Inference