Scalable Random Wavelet Features: Efficient Non-Stationary Kernel Approximation with Convergence Guarantees
Sawan Kumar, Souvik Chakraborty

TL;DR
This paper introduces Random Wavelet Features (RWF), a scalable method for approximating non-stationary kernels using wavelets, with theoretical guarantees and superior performance on complex datasets.
Contribution
The paper presents RWF, a novel framework that extends random features to non-stationary kernels using wavelets, with proven convergence and positive definiteness.
Findings
RWF outperforms stationary random features on synthetic and real datasets.
RWF provides a scalable and expressive alternative to complex non-stationary models.
Theoretical analysis guarantees unbiasedness and convergence of RWF.
Abstract
Modeling non-stationary processes, where statistical properties vary across the input domain, is a critical challenge in machine learning; yet most scalable methods rely on a simplifying assumption of stationarity. This forces a difficult trade-off: use expressive but computationally demanding models like Deep Gaussian Processes, or scalable but limited methods like Random Fourier Features (RFF). We close this gap by introducing Random Wavelet Features (RWF), a framework that constructs scalable, non-stationary kernel approximations by sampling from wavelet families. By harnessing the inherent localization and multi-resolution structure of wavelets, RWF generates an explicit feature map that captures complex, input-dependent patterns. Our framework provides a principled way to generalize RFF to the non-stationary setting and comes with a comprehensive theoretical analysis, including…
Peer Reviews
Decision·ICLR 2026 Poster
The paper derives potentially non-stationary kernels using wavelet based dictionaries together with the random features machinery. Experimental results show slightly better RMSE values for GP regression with slightly reduced training cost compared to baselines on most experiments.
Theoretical novelty of the work is extremely limited. The positive-definiteness result (Thm. 4.1) is a direct application of the classic “integral of feature products” construction; nothing wavelet‑specific is used. The unbiasedness and uniform‑convergence proofs follow the standard random‑features and Monte-Carlo estimator analysis tools. Also, the claim about non-stationarity is violated if the weighting probabality distribution $p(s,t)$ in Eqaution 3.3 is taken to be of the form $p_s(s) p_t(t
1 It provides a rigorous theoretical analysis of Random Wavelet Features (RWF), establishing positive definiteness, unbiasedness, variance bounds, and uniform convergence with explicit sample complexity. 2 RWF achieves O(ND²) training complexity, maintaining the scalability of random feature methods while effectively encoding non-stationarity via wavelet localization. 3 Extensive empirical evaluations on synthetic, speech, and large-scale regression datasets demonstrate that RWF consistently o
1 While the proposed RWF framework is clearly presented and supported by solid theoretical analysis, I have concerns regarding its novelty. The core formulation of RWF (Eqs. 3.1–3.3) and the sampling procedure (Algorithm 1) appear conceptually similar to existing RWF methods, which also construct kernel approximations using randomly sampled wavelet bases. The authors should clarify what specific differences or innovations distinguish RWF from earlier approaches such as L. Sun et al., “Wavelet-b
* The paper is well written and organised in a way that makes the presentation easy to follow. * The proposed idea is intuitive and the theoretical analysis follows a similar approach to the analysis of RFF methods. * The experimental evaluations compare against a wide range of baselines revealing mostly strong performance improvements. * Interesting ablation studies are included, comparing for example training time and memory usage, the latter of which is usually rare to find.
* My main concern is that the related work discussion fails to cover previous wavelet-based kernel methods and their approximations. A quick literature search reveals wavelet support vector machines (Zhang et al., 2004) and other methods also apparently using wavelet-based kernel decompositions (Guo et al., 2004; Yger, 2011), and potentially others that I'd have missed. Even if these methods are not directly solving the same modelling problem, it'd be important to contrast this paper's approach
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis · Stochastic Gradient Optimization Techniques
