Generating Synthetic Datasets by Interpolating along Generalized   Geodesics

Jiaojiao Fan; David Alvarez-Melis

arXiv:2306.06866·cs.LG·June 13, 2023·2 cites

Generating Synthetic Datasets by Interpolating along Generalized Geodesics

Jiaojiao Fan, David Alvarez-Melis

PDF

Open Access

TL;DR

This paper introduces a novel method for synthesizing new datasets by interpolating existing datasets along generalized geodesics in optimal transport space, enabling targeted pretraining for transfer learning.

Contribution

It extends dataset synthesis to all possible combinations of datasets using generalized geodesics, including those with different label sets, via scalable optimal transport methods.

Findings

01

Effective dataset interpolation between unrelated label sets.

02

Improved transfer learning performance in computer vision tasks.

03

Scalable and efficient synthesis of targeted datasets.

Abstract

Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain -- where the model will ultimately be used -- is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks