Clustering-Based Validation Splits for Model Selection under Domain Shift
Andrea Napoli, Paul White

TL;DR
This paper introduces a novel clustering-based method for creating training-validation splits that maximize distribution mismatch, improving model selection under domain shift by leveraging kernel k-means and linear programming.
Contribution
It proposes a new data splitting technique based on MMD and kernel k-means, with a constrained clustering algorithm that enhances domain shift robustness without extra metadata.
Findings
Outperforms existing splitting strategies across multiple datasets
MMD between splits correlates with test domain accuracy
Algorithm guarantees convergence and controls split properties
Abstract
This paper considers the problem of model selection under domain shift. Motivated by principles from distributionally robust optimisation and domain adaptation theory, it is proposed that the training-validation split should maximise the distribution mismatch between the two sets. By adopting the maximum mean discrepancy (MMD) as the measure of mismatch, it is shown that the partitioning problem reduces to kernel k-means clustering. A constrained clustering algorithm, which leverages linear programming to control the size, label, and (optionally) group distributions of the splits, is presented. The algorithm does not require additional metadata, and comes with convergence guarantees. In experiments, the technique consistently outperforms alternative splitting strategies across a range of datasets and training algorithms, for both domain generalisation and unsupervised domain adaptation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Systems and Technologies · Infrastructure Maintenance and Monitoring · Model Reduction and Neural Networks
