Variance-Reducing Couplings for Random Features
Isaac Reid, Stratis Markou, Krzysztof Choromanski, Richard E. Turner,, Adrian Weller

TL;DR
This paper introduces variance reduction techniques for random features using optimal transport couplings, improving efficiency and accuracy in kernel approximations across various machine learning models, including transformers and Gaussian processes.
Contribution
It proposes a unifying optimal transport framework for variance reduction in random features, with theoretical guarantees and practical benefits demonstrated on multiple applications.
Findings
Couplings improve convergence of random feature estimates.
Variance reduction benefits vary depending on the task and coupling properties.
Optimal properties for attention estimation differ from those for general variance reduction.
Abstract
Random features (RFs) are a popular technique to scale up kernel methods in machine learning, replacing exact kernel evaluations with stochastic Monte Carlo estimates. They underpin models as diverse as efficient transformers (by approximating attention) to sparse spectrum Gaussian processes (by approximating the covariance function). Efficiency can be further improved by speeding up the convergence of these estimates: a variance reduction problem. We tackle this through the unifying lens of optimal transport, finding couplings to improve RFs defined on both Euclidean and discrete input spaces. They enjoy theoretical guarantees and sometimes provide strong downstream gains, including for scalable approximate inference on graphs. We reach surprising conclusions about the benefits and limitations of variance reduction as a paradigm, showing that other properties of the coupling should be…
Peer Reviews
Decision·ICLR 2025 Poster
I find the general idea of the paper fascinating, namely formulating the variance reduction task as optimal transport. This approach can also be applied to similar Monte Carlo methods in future studies. It also allows for the application of various techniques in OT for studying and improving kernel estimation methods.
Although the paper is well-written and easy to understand, the presentation of the paper can be improved. Much of the technical details that are summarized in the main body cannot be easily understood without referring to the appendix. There are confusing notations and definitions, especially for the case of graph kernels. I will mention some of them in the next part. I also have a more specific questions/concerns about the justification of this framework which is explained in the next part.
The manuscript is well presented, structured, and clearly written. There is a high degree of novelty in the application of optimal transport in determining the feature couplings. The proposed method is tested on a good selection of real world datasets.
I have some concerns regarding some of the experimental results presented in the manuscript: Regarding Table 1 - Running only d features for RFF strikes me as a rather low number to choose, especially for these modestly sized datasets. It would be helpful to clarify how the results impacted when using a larger numbers of features. It would be great to see something analogous to the Figure 1 of the Yu et al 'Orthogonal Random Features' paper which depicts 1 < d/D < 10. While I can appreciate t
1. **Innovative Use of Optimal Transport**: The paper employs OT as a novel framework to address the variance reduction problem in RFs, which is a creative approach that ties together theoretical insights with practical application. 2. **Comprehensive Coverage**: It addresses both Euclidean and discrete input spaces, providing a unified strategy applicable across different domains. 3. **Theoretical Guarantees and Empirical Validation**: This paper offers theoretical guarantees and empirical vali
1. The definition and relationship between variance reduction in kernel estimation and optimal transport should be more detailed and introduced for non-specialists to understand directly. 2. The computation and theorem (Theorem 3.2) is only for $m=2$, and the authors apply the copula tool as numerical OT solvers for multi-marginal OT. 3. Despite variance reduction, the downstream impact on model performance can be inconsistent, especially for transformers. 4. More comparisons between this paper
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
