Variance-Reducing Couplings for Random Features

Isaac Reid; Stratis Markou; Krzysztof Choromanski; Richard E. Turner,; Adrian Weller

arXiv:2405.16541·stat.ML·October 4, 2024

Variance-Reducing Couplings for Random Features

Isaac Reid, Stratis Markou, Krzysztof Choromanski, Richard E. Turner,, Adrian Weller

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces variance reduction techniques for random features using optimal transport couplings, improving efficiency and accuracy in kernel approximations across various machine learning models, including transformers and Gaussian processes.

Contribution

It proposes a unifying optimal transport framework for variance reduction in random features, with theoretical guarantees and practical benefits demonstrated on multiple applications.

Findings

01

Couplings improve convergence of random feature estimates.

02

Variance reduction benefits vary depending on the task and coupling properties.

03

Optimal properties for attention estimation differ from those for general variance reduction.

Abstract

Random features (RFs) are a popular technique to scale up kernel methods in machine learning, replacing exact kernel evaluations with stochastic Monte Carlo estimates. They underpin models as diverse as efficient transformers (by approximating attention) to sparse spectrum Gaussian processes (by approximating the covariance function). Efficiency can be further improved by speeding up the convergence of these estimates: a variance reduction problem. We tackle this through the unifying lens of optimal transport, finding couplings to improve RFs defined on both Euclidean and discrete input spaces. They enjoy theoretical guarantees and sometimes provide strong downstream gains, including for scalable approximate inference on graphs. We reach surprising conclusions about the benefits and limitations of variance reduction as a paradigm, showing that other properties of the coupling should be…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

I find the general idea of the paper fascinating, namely formulating the variance reduction task as optimal transport. This approach can also be applied to similar Monte Carlo methods in future studies. It also allows for the application of various techniques in OT for studying and improving kernel estimation methods.

Weaknesses

Although the paper is well-written and easy to understand, the presentation of the paper can be improved. Much of the technical details that are summarized in the main body cannot be easily understood without referring to the appendix. There are confusing notations and definitions, especially for the case of graph kernels. I will mention some of them in the next part. I also have a more specific questions/concerns about the justification of this framework which is explained in the next part.

Reviewer 02Rating 6Confidence 4

Strengths

The manuscript is well presented, structured, and clearly written. There is a high degree of novelty in the application of optimal transport in determining the feature couplings. The proposed method is tested on a good selection of real world datasets.

Weaknesses

I have some concerns regarding some of the experimental results presented in the manuscript: Regarding Table 1 - Running only d features for RFF strikes me as a rather low number to choose, especially for these modestly sized datasets. It would be helpful to clarify how the results impacted when using a larger numbers of features. It would be great to see something analogous to the Figure 1 of the Yu et al 'Orthogonal Random Features' paper which depicts 1 < d/D < 10. While I can appreciate t

Reviewer 03Rating 6Confidence 2

Strengths

1. **Innovative Use of Optimal Transport**: The paper employs OT as a novel framework to address the variance reduction problem in RFs, which is a creative approach that ties together theoretical insights with practical application. 2. **Comprehensive Coverage**: It addresses both Euclidean and discrete input spaces, providing a unified strategy applicable across different domains. 3. **Theoretical Guarantees and Empirical Validation**: This paper offers theoretical guarantees and empirical vali

Weaknesses

1. The definition and relationship between variance reduction in kernel estimation and optimal transport should be more detailed and introduced for non-specialists to understand directly. 2. The computation and theorem (Theorem 3.2) is only for $m=2$, and the authors apply the copula tool as numerical OT solvers for multi-marginal OT. 3. Despite variance reduction, the downstream impact on model performance can be inconsistent, especially for transformers. 4. More comparisons between this paper

Videos

Variance-Reducing Couplings for Random Features· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques