Geometric Dataset Distances via Optimal Transport
David Alvarez-Melis, Nicol\`o Fusi

TL;DR
This paper introduces a new, model-agnostic dataset distance measure based on optimal transport, enabling meaningful, training-free comparisons even with disjoint label sets, with strong theoretical foundations and practical relevance.
Contribution
It proposes a novel dataset distance using optimal transport that is model-agnostic, training-free, and applicable to disjoint label sets, with solid theoretical support.
Findings
The distance correlates well with transfer learning difficulty.
It provides meaningful comparisons across diverse datasets.
The method is theoretically grounded and practically effective.
Abstract
The notion of task similarity is at the core of various machine learning paradigms, such as domain adaptation and meta-learning. Current methods to quantify it are often heuristic, make strong assumptions on the label sets across the tasks, and many are architecture-dependent, relying on task-specific optimal parameters (e.g., require training a model on each dataset). In this work we propose an alternative notion of distance between datasets that (i) is model-agnostic, (ii) does not involve training, (iii) can compare datasets even if their label sets are completely disjoint and (iv) has solid theoretical footing. This distance relies on optimal transport, which provides it with rich geometry awareness, interpretable correspondences and well-understood properties. Our results show that this novel distance provides meaningful comparison of datasets, and correlates well with transfer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Advanced Neural Network Applications
