A Look at the Effect of Sample Design on Generalization through the Lens   of Spectral Analysis

Bhavya Kailkhura; Jayaraman J. Thiagarajan; Qunwei Li; Peer-Timo; Bremer

arXiv:1906.02732·cs.LG·June 11, 2019

A Look at the Effect of Sample Design on Generalization through the Lens of Spectral Analysis

Bhavya Kailkhura, Jayaraman J. Thiagarajan, Qunwei Li, Peer-Timo, Bremer

PDF

Open Access

TL;DR

This paper introduces a spectral analysis framework to understand how sampling patterns influence the generalization error of machine learning models, linking geometric properties to spectral forms and providing error bounds.

Contribution

It develops a novel spectral analysis approach in Euclidean space that connects sampling geometry with generalization performance, offering insights for designing optimal sampling strategies.

Findings

01

Spectral properties of sampling patterns affect generalization error.

02

Error bounds and convergence rates are derived for various sampling methods.

03

Insights are provided that are independent of specific learning architectures.

Abstract

This paper provides a general framework to study the effect of sampling properties of training data on the generalization error of the learned machine learning (ML) models. Specifically, we propose a new spectral analysis of the generalization error, expressed in terms of the power spectra of the sampling pattern and the function involved. The framework is build in the Euclidean space using Fourier analysis and establishes a connection between some high dimensional geometric objects and optimal spectral form of different state-of-the-art sampling patterns. Subsequently, we estimate the expected error bounds and convergence rate of different state-of-the-art sampling patterns, as the number of samples and dimensions increase. We make several observations about generalization error which are valid irrespective of the approximation scheme (or learning architecture) and training (or…

Figures14

Click any figure to enlarge with its caption.

Equations91

P (k) = \frac{1}{N} ∣ S (k) ∣^{2} = \frac{1}{N} j, ℓ \sum e^{- 2 π i k. (x_{ℓ} - x_{j})},

P (k) = \frac{1}{N} ∣ S (k) ∣^{2} = \frac{1}{N} j, ℓ \sum e^{- 2 π i k. (x_{ℓ} - x_{j})},

P (k)

P (k)

G (r) = 1 + \frac{1}{N} r^{1 - \frac{d}{2}} H_{\frac{d}{2} - 1} (ρ^{\frac{d}{2} - 1} (P (ρ) - 1))

G (r) = 1 + \frac{1}{N} r^{1 - \frac{d}{2}} H_{\frac{d}{2} - 1} (ρ^{\frac{d}{2} - 1} (P (ρ) - 1))

R_{P} (h) ≜ E_{P (x, y)} [l (h (x), y)] = \int l (h (x), y) d P (x, y),

R_{P} (h) ≜ E_{P (x, y)} [l (h (x), y)] = \int l (h (x), y) d P (x, y),

R_{S} (h) ≜ \frac{1}{N} i = 1 \sum N l (h (x_{i}), y_{i})

R_{S} (h) ≜ \frac{1}{N} i = 1 \sum N l (h (x_{i}), y_{i})

R_{S} (h) ≜ \frac{1}{N} \int_{D} S (x) l (h (x), y) d x

R_{S} (h) ≜ \frac{1}{N} \int_{D} S (x) l (h (x), y) d x

gen (h) ≜ E_{S} [(R_{P} (h) - R_{S} (h))^{2}]

gen (h) ≜ E_{S} [(R_{P} (h) - R_{S} (h))^{2}]

gen (h)

gen (h)

R_{S} (h) ≜ \frac{1}{N} \int_{ϕ} F_{S} (ω) F_{l} (ω)^{⊺} d ω

R_{S} (h) ≜ \frac{1}{N} \int_{ϕ} F_{S} (ω) F_{l} (ω)^{⊺} d ω

gen (h)

gen (h)

gen (h) ≜ \frac{1}{N} \int_{Θ} E (P_{S} (ω)) P_{l} (ω) d ω

gen (h) ≜ \frac{1}{N} \int_{Θ} E (P_{S} (ω)) P_{l} (ω) d ω

gen (h) ≜ \frac{μ ( S ^{d - 1} )}{N} \int_{0}^{\infty} ρ^{d - 1} E (\hat{P_{S}} (ρ)) \hat{P_{l}} (ρ) d ρ,

gen (h) ≜ \frac{μ ( S ^{d - 1} )}{N} \int_{0}^{\infty} ρ^{d - 1} E (\hat{P_{S}} (ρ)) \hat{P_{l}} (ρ) d ρ,

gen (h) \leq \frac{μ ( S ^{d - 1} )}{N} c_{l} \int_{0}^{ρ_{0}} ρ^{d - 1} E (\hat{P_{S}} (ρ)) d ρ .

gen (h) \leq \frac{μ ( S ^{d - 1} )}{N} c_{l} \int_{0}^{ρ_{0}} ρ^{d - 1} E (\hat{P_{S}} (ρ)) d ρ .

gen (h) \leq \frac{μ ( S ^{d - 1} )}{N} c_{l} \int_{0}^{ρ_{0}} ρ^{d - 1} E (\hat{P_{S}} (ρ)) d ρ + \frac{μ ( S ^{d - 1} )}{N} c_{l}^{'} \int_{ρ_{0}}^{\infty} ρ^{- 2} E (\hat{P_{S}} (ρ)) d ρ .

gen (h) \leq \frac{μ ( S ^{d - 1} )}{N} c_{l} \int_{0}^{ρ_{0}} ρ^{d - 1} E (\hat{P_{S}} (ρ)) d ρ + \frac{μ ( S ^{d - 1} )}{N} c_{l}^{'} \int_{ρ_{0}}^{\infty} ρ^{- 2} E (\hat{P_{S}} (ρ)) d ρ .

gen_{b} (h)

gen_{b} (h)

gen_{w} (h)

gen_{w} (h)

=

P_{S}(\rho-\rho_{z})=\left\{\begin{array}[]{rll}0&\mbox{if}\ \rho\leq\rho_{z},\\ 1&\mbox{if}\ \rho>\rho_{z}.\end{array}\right.

P_{S}(\rho-\rho_{z})=\left\{\begin{array}[]{rll}0&\mbox{if}\ \rho\leq\rho_{z},\\ 1&\mbox{if}\ \rho>\rho_{z}.\end{array}\right.

G (r) = 1 - \frac{1}{N} (\frac{ρ _{z}}{r})^{\frac{d}{2}} J_{\frac{d}{2}} (2 π ρ_{z} r) .

G (r) = 1 - \frac{1}{N} (\frac{ρ _{z}}{r})^{\frac{d}{2}} J_{\frac{d}{2}} (2 π ρ_{z} r) .

ρ_{z}^{*} = d \frac{N Γ ( 1 + \frac{d}{2} )}{π ^{d /2}}

ρ_{z}^{*} = d \frac{N Γ ( 1 + \frac{d}{2} )}{π ^{d /2}}

gen_{b} (h)

gen_{b} (h)

gen_{b} (h)

gen_{b} (h)

=

gen_{w} (h)

gen_{w} (h)

gen_{w} (h)

gen_{w} (h)

=

=

G_{S}(r-r_{min})=\left\{\begin{array}[]{rll}0&\mbox{if}\ r\leq r_{min},\\ 1&\mbox{if}\ r>r_{min}.\end{array}\right.

G_{S}(r-r_{min})=\left\{\begin{array}[]{rll}0&\mbox{if}\ r\leq r_{min},\\ 1&\mbox{if}\ r>r_{min}.\end{array}\right.

P_{S} (ρ - r_{min}) = 1 - N (\frac{2 π r _{min}}{ρ})^{d /2} J_{d /2} (ρ r_{min}),

P_{S} (ρ - r_{min}) = 1 - N (\frac{2 π r _{min}}{ρ})^{d /2} J_{d /2} (ρ r_{min}),

r_{min}^{*} = d \frac{Γ ( 1 + \frac{d}{2} )}{π ^{d /2} N}

r_{min}^{*} = d \frac{Γ ( 1 + \frac{d}{2} )}{π ^{d /2} N}

gen_{b} (h)

gen_{b} (h)

gen_{w} (h)

gen_{w} (h)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Sparse and Compressive Sensing Techniques · Neural Networks and Applications

Full text

A Look at the Effect of Sample Design on Generalization through the Lens of Spectral Analysis

Bhavya Kailkhura

Lawrence Livermore National Laboratories

Livermore, CA 15213

[email protected]

&Jayaraman J. Thiagarajan

Lawrence Livermore National Laboratories

Livermore, CA 15213

[email protected]

&Qunwei Li

Lawrence Livermore National Laboratories

Livermore, CA 15213

[email protected]

&Peer-Timo Bremer

Lawrence Livermore National Laboratories

Livermore, CA 15213

[email protected]

Abstract

This paper provides a general framework to study the effect of sampling properties of training data on the generalization error of the learned machine learning (ML) models. Specifically, we propose a new spectral analysis of the generalization error, expressed in terms of the power spectra of the sampling pattern and the function involved. The framework is build in the Euclidean space using Fourier analysis and establishes a connection between some high dimensional geometric objects and optimal spectral form of different state-of-the-art sampling patterns. Subsequently, we estimate the expected error bounds and convergence rate of different state-of-the-art sampling patterns, as the number of samples and dimensions increase. We make several observations about generalization error which are valid irrespective of the approximation scheme (or learning architecture) and training (or optimization) algorithms. Our result also sheds light on ways to formulate design principles for constructing optimal sampling methods for particular problems.

1 Introduction

Analyzing the generalization error of a learning algorithm is essential for estimating how well the generated hypothesis will apply to unknown test data. Traditionally, generalization error is analyzed based on the complexity of the function class, such as, the Vapnik-Chervonenkis (VC) dimension and the Rademacher complexity [2], or properties of the learning algorithm, such as uniform stability [3], and upper bounds on the error are derived. Recently, the authors in [16] showed that the mutual information between the collection of empirical risks of the available hypotheses and the final output of the algorithm can be used to analyze the generalization error in learning problems. In a similar information-theoretic setup, the authors in [1] proposed to bound generalization error using the total-variation distance.

Here, we are interested in studying generalization from the viewpoint of the sampler generating the training data. Sample design has been a long-standing research area in statistics, and a plethora of sampling solutions exist in the literature with a wide-range of assumptions and statistical guarantees; see [7, 13] for a detailed review of related methods. The properties of the sampling distribution directly control the expected convergence behavior of the generalization error, as sample size grows asymptotically. Consequently, designing an optimal sampler for a learning algorithm requires quantifying how the properties of the sampling distribution affects the generalization error. Unfortunately, existing theoretical tools for analyzing generalization error are not applicable for our purpose as they do not provide a direct connection to the sample properties, i.e. uniformity, randomness, etc. In this context, this paper addresses two important challenges: $1)$ identifying expressive metrics to quantify sample properties, and $2)$ bounding generalization error in terms of those tractable sample properties.

Generically, a good sampling technique aims to cover the input space as uniformly as possible, in order to generate the so-called space-filling experiment designs [10]. Since it is challenging to qualitatively evaluate the space-filling property, simple scalar metrics such as discrepancy [5] or geometric distances (maximin or minimax distance of a sample design [17]) are utilized. However, recent studies have shown that these scalar metrics are not very descriptive, and when used as the design objective, often results in poor-quality samples [12]. Furthermore, existing sampling distributions are not designed to specifically improve generalization error of learning algorithms. This is due to the lack of a principled framework for connecting sampling properties to generalization error. To address this challenge, we develop a novel spectral analysis framework to study generalization error, expressed in terms of the power spectra of sampling patterns as well as the function to be recovered.

Contributions and Findings: First, we propose to adopt spectral analysis for characterizing the space-filling property of sampling patterns. More specifically, we use tools from statistical mechanics to connect the spectral properties of a sampling pattern with its spatial properties. Next, we develop an analysis framework for studying generalization error behavior of a learning algorithm through the lens of spectral properties of the sample design. Using this framework, for isotropic, homogeneous sampling patterns (i.e. we can use a radially averaged power spectrum), we derive best and worst-case generalization error bounds.

While majority of existing literature on generalization error characterization based on sampling [3] have focused on uniform random sampling, the proposed analysis framework allows us to study the behavior of a large class of sample designs. In particular, we consider the blue noise [11, 9] and the Poisson disk sampling (PDS) distributions and obtain sampler-specific bounds (see Figure 1 in the supplementary material for examples of the distributions used). We characterize the gain due to blue noise and PDS samples over random sampler in a closed-form. This analysis further helps us to formulate design principles to construct optimal sampling methods to specific ML problems. Finally, we make interesting (counter-intuitive) observations on the convergence behavior of generalization error with increasing dimensions, and hence develop novel spectral metrics to obtain meaningful convergence results for different sampling patterns (included in the supplementary material).

2 Preliminaries - Spectral Analysis for Sampling

Fourier analysis is a classical approach for studying properties of sampling distributions. For example, the power spectral density (PSD) can be used to assess the quality of sampling distributions. Alternately, analyzing spatial characteristics of samples can also provide crucial insights. While such a spatial analysis has been traditionally carried out using heuristic measures for uniformity of sampling patterns, we adopt a more descriptive characterization.

Power Spectral Density: For a finite set of $N$ samples, $\{\mathbf{x}_{j}\}_{j=1}^{N}$ , in a region with unit volume, the radially-averaged power spectral density describes how the signal power is distributed over frequencies. It is formally defined as

[TABLE]

where $S(\mathbf{k})$ denotes the spectral coefficients. For isotropic distributions, we have $P(k)=P(|\mathbf{k}|)$ .

Pair Correlation Function: A PCF describes the joint probability of having samples at two locations at the same time. It can be more precisely defined in terms of the intensity $\lambda$ and product density $\rho$ of a point process $X$ [14]. The intensity $\lambda(X)$ of $X$ is the average number of points in an infinitesimal volume around $X$ . For isotropic point processes, this is a constant. Let $\{B_{i}\}$ denote the set of infinitesimal spheres around the points, and $\{dV_{i}\}$ denote the volume measures of $B_{i}$ . The product density $P(\mathbf{x_{1}},\cdots,\mathbf{x_{N}})=\beta(\mathbf{x_{1}},\cdots,\mathbf{x_{N}})dV_{1}\cdots dV_{N}$ . In the isotropic case, for a pair of points, $\beta$ depends only on the distance between the points, hence one can write $\beta(\mathbf{x_{i}},\mathbf{x_{j}})=\beta(||\mathbf{x_{i}}-\mathbf{x_{j}}||)=\beta(r)$ and $P(r)=\beta(r)dxdy$ . The PCF is then defined as $G(r)=\dfrac{\beta}{\lambda^{2}}.$

Relating PCF and PSD via Fourier Transform: The PSD and PCF of a point distribution are related via the Fourier transform as follows:

[TABLE]

where $F(.)$ denotes the $d$ -dimensional Fourier transform. Next, we establish a fundamental relationship between PSD and PCF for radially symmetric or isotropic distributions.

Theorem 1.

The pair correlation function and the power spectral density of radially symmetric function are related as follows:

[TABLE]

where $H_{v}(f(x))=2\pi\int_{0}^{\infty}xf(x)J_{v}(2\pi rx)dx$ is the Hankel transform.

Proof.

Please see supplementary material. ∎

Realizability: The two necessary mathematical conditions 111Whether or not these two conditions are not only necessary but also sufficient is still an open question (however, no counterexamples are known). that a sampling pattern must satisfy to be realizable are: (a) its PSD must be non-negative, i.e., $P(\rho)\geq 0,\;\forall\rho$ , and (b) its pair correlation function must be non-negative, i.e., $G(r)\geq 0,\;\forall r$ .

3 Risk Minimization using Monte Carlo Estimates

We consider the following general setup, which encompasses several supervised learning formulations. We consider two spaces of objects $X\in\mathbb{T}^{d}$ (toroidal unit cube $[0,1]^{d}$ ) and $Y\in\mathbb{R}$ , and the goal is to learn a function $h:X\rightarrow Y$ (often called hypothesis) which outputs $y\in Y$ for a given $x\in X$ . We assume access to training data comprised of $N$ samples $S=\{(x_{1},y_{1}),\cdots,(x_{N},y_{N})\}$ drawn i.i.d. from an unknown distribution $P(x,y)$ . Supervised learning attempts to infer a hypothesis $h(.)$ that minimizes the population risk:

[TABLE]

where $l(.,.)$ denotes the loss function.

Empirical Risk Minimization: In general, the joint distribution $P(x,y)$ is unknown to the learning algorithm and hence the risk $R_{P}(h)$ cannot be computed. However, we often use an approximation, referred as empirical risk, obtained by averaging the loss function on the training data:

[TABLE]

Note that the empirical risk ${R}_{S}(h)$ is a Monte Carlo (MC) estimate of the population risk $R_{P}(h)$ . It also has a continuous form

[TABLE]

where $\mathbb{D}$ is the sampling domain, $S(x)$ is the sampling function, i.e., a sampling pattern rewritten as a random signal $S$ composed of $N$ Dirac functions located at sample positions $S(x)=\sum\delta(x-x_{i})$ for $i=1,\cdots,N$ .

Generalization Error: In ML and statistical learning theory, the performance of a supervised learning algorithm is measured by the generalization error, which measures how accurately an algorithm is able to predict outcome values for previously unseen data. More specifically, we adopt the following definition of generalization error:

[TABLE]

which is the expected difference between the population risk of the output hypothesis and its empirical risk on the training data. The generalization error also has an alternating form with a direct link to the statistical properties of the sampling pattern:

[TABLE]

We consider sampling patterns which are homogeneous, i.e. statistical properties of a sample are invariant to translation over the domain $\mathbb{D}$ . Homogeneous sampling patterns are unbiased in nature, thus, the generalization error arises only from the variance. Note that the variance analysis of Monte-Carlo integration has been considered in the literature [6, 18, 15] and we build upon these methods. However, similar analysis is of generalization error in an ML context has not been carried out yet.

4 Connecting Generalization Error with Spectral Properties of Samples

In this section, we will express generalization error (or variance) in terms of the power spectra of both $S$ and $l$ . To this end, we use the Monte Carlo estimator of risk in the Fourier domain, and derive the variance in the Fourier domain by leveraging our homogeneity assumption on sampling patterns.

4.1 Monte Carlo Estimator of Risk in the Spectral domain

The MC estimator of the risk as given in equation (4) can be characterized in the Fourier domain $\phi$ using the fact that dot-product of functions (the integral of the product) is equivalent to the dot-product of their Fourier coefficients. This allows to us to build the MC estimator of the risk as follows:

[TABLE]

where $\mathcal{F}_{S}$ , $\mathcal{F}_{l}$ denote the Fourier transforms of the sampling function $S$ and the loss function $l$ .

4.2 Generalization Error via the Spectral Analysis

We now use the spectral domain version of empirical risk to define generalization error:

[TABLE]

where $\mathcal{F}_{S,l}(\omega,\omega^{\prime})\triangleq\mathcal{F}_{S}(\omega)\cdot\mathcal{F}_{l}(\omega)^{\intercal}\cdot\mathcal{F}_{S}(\omega^{\prime})^{\intercal}\cdot\mathcal{F}_{l}(\omega^{\prime})$ . Next, we provide an explicit closed-form relation of generalization error with the power spectra of both the sampling pattern and the loss function. To derive this relation, we first simplify (7) by restricting our analysis to homogeneous sampling patterns, which are unbiased.

Lemma 1.

The generalization error in terms of the power spectra of both the sampling pattern and the loss function in the toroidal domain can be obtained as:

[TABLE]

Proof.

Please see supplementary material. ∎

If homogeneous sampling is isotropic (i.e., the power spectrum is radially symmetric), then the error can be computed from the radial mean power spectrum of the loss $\hat{\mathcal{P}_{l}}$ and the sampling pattern $\hat{\mathcal{P}_{S}}$ .

Theorem 2.

The generalization error for isotropic homogeneous sampling patterns (in polar coordinates) is given by

[TABLE]

where $\mu(\mathcal{S}^{d-1})$ is the Lebesgue measure of a $(d-1)$ -dimensional unit sphere in $\mathbb{R}^{d}$ given by $2\sqrt{\pi^{d}}/\Gamma(d/2)$ which is the surface area of the $(d-1)$ -dimensional unit sphere.

5 Best and Worst Case Generalization Error

Using the proposed spectral analysis framework to predict generalization error requires us to explicitly know the power spectra of the loss function, which is usually unknown. Thus, we restrict our analysis to a particular class of integrable functions of the form $l(x)_{\mathcal{X}_{\Omega}}$ with $l(x)$ smooth and $\Omega$ a bounded domain with a smooth boundary ( $\mathcal{X}_{\Omega}$ is the characteristic function of $\Omega$ ) [4]. We consider a best-case function and a worst-case function, both from this class of functions to derive the error convergence rates, as the number of samples $N$ and dimension $d$ grow. Note that the power spectra of sampling distributions are usually known in advance. We show that this information can be used in our framework to compute the generalization error bounds. Note, We perform our analysis following (9), where the error is characterized by the radial mean power spectra of both sampling pattern and loss function.

5.1 Best-Case Generalization Error

We define our best-case function directly in the spectral domain with the radial mean power spectrum profile $\hat{\mathcal{P}_{l}}(\rho)$ which is a constant $c_{l}$ for $(\rho<\rho_{0})$ , and zero elsewhere. The constant $c_{l}$ comes from the fact that the power spectrum is bounded. The best case error can be thus obtained from (9) as follows:

Lemma 2.

The best-case generalization error for isotropic homogeneous sampling patterns (in polar coordinate) is given by

[TABLE]

5.2 Worst-Case Generalization Error

For the worst-case, we consider our function to exhibit a radial mean power spectrum which is $\hat{\mathcal{P}_{l}}(\rho)$ which is upper bounded by a constant $c_{l}$ for $(\rho<\rho_{0})$ , and $c_{l}^{\prime}\rho^{-d-1}$ elsewhere, where $c_{l}$ and $c_{l}^{\prime}$ are non-zero positive constants. This spectral profile has a decay rate $O(\rho^{-d-1})$ for $\rho>\rho_{0}$ .

Lemma 3.

The worst-case generalization error for isotropic homogeneous sampling patterns (in polar coordinate) is given by

[TABLE]

6 Sampler-Specific Generalization Error Bounds

In the previous section, we obtained the best and worst-case generalization error as a function of the sampling power spectrum $\mathbb{E}(\hat{\mathcal{P}_{S}}(\rho))$ . In this section, we study the effects of different sampling distributions on the generalization error.

Random (or Poisson) Sampler: This has a constant power spectrum since point samples are uncorrelated, i.e., $\mathbb{E}(\hat{\mathcal{P}_{S}}(\rho))=1,\forall\rho$ . For this spectral profile, the best-case generalization error can be obtained as:

[TABLE]

and the worst-case generalization error can be bounded as:

[TABLE]

Blue Noise Sampler: Blue noise distributions are aimed at replacing visible aliasing artifacts with incoherent noise, and its properties are typically defined in the spectral domain. We consider the step blue noise pattern defined as follows: (a) the spectrum should be close to zero for low frequencies, which indicates the range of frequencies that can be recovered exactly; (b) the spectrum should be a constant one for high frequencies, i.e. represent uniform white noise, which reduces the risk of aliasing. The low frequency band with minimal energy is referred to as the zero region. Formally,

[TABLE]

The zero region $0\leq\rho\leq\rho_{z}$ indicates the range of frequencies that can be represented with no aliasing and the flat region $\rho>\rho_{z}$ guarantees that aliasing artifacts are mapped to broadband noise.

Lemma 4 ([11]).

The pair correlation function of a Step blue noise sample of size $N$ in $d$ dimensions, for a given zero region $\rho_{z}$ is given by

[TABLE]

Using Lemma 4, we can pose an optimization problem for determining the maximum achievable zero region $\rho_{z}$ , that does not violate realizability conditions, for a given sample budget $N$ .

Lemma 5.

The maximum achievable zero region using $N$ Step blue noise samples in $d$ dimensions is equal to inverse of the $d$ -th root of the volume of a $d$ -dimensional hyper-sphere with radius $1/\sqrt[d]{N}$ ,

[TABLE]

where $\Gamma(.)$ is the gamma function. Equivalently, we can determine the minimum number of samples needed to construct a step blue noise pattern, $N=\frac{\pi^{d/2}\rho_{z}^{d}}{\Gamma(1+d/2)}$ .

Proof.

Please refer to the supplementary material. ∎

For this spectral profile, the best-case generalization error can be obtained as:

[TABLE]

Note that, when $\rho_{0}\leq\rho_{z}^{*}$ the best-case generalization error $\text{gen}_{b}(h)=0$ , and when $\rho_{0}>\rho_{z}^{*}$ , we have

[TABLE]

The worst-case generalization error can be obtained as:

[TABLE]

Note that, when $\rho_{0}\leq\rho_{z}^{*}$ the worst-case generalization error $\text{gen}_{w}(h)=\dfrac{\mu c_{l}^{\prime}}{N\rho_{z}^{*}}$ , and when $\rho_{0}>\rho_{z}^{*}$ ,

[TABLE]

Poisson Disk Sampler: Without any prior knowledge of the function $f$ of interest, a reasonable objective for sampling is that the samples should be random to provide an equal chance of finding features of interest. However, to avoid sampling only parts of the parameter space, a second objective is required to cover the space in $\mathcal{D}$ uniformly. Poisson Disk Sampling (PDS) pattern are designed to achieve these objectives. In particular, the step PCF sampling pattern is a set of samples that are distributed according to a uniform probability distribution (Objective 1: Randomness) but no two samples are closer than a given minimum distance $r_{min}$ (Objective 2: Coverage). Formally,

[TABLE]

The PDS can also be defined in the spectral domain as follows:

Lemma 6 ([12]).

The power spectra of an ideal Poisson disk sampling pattern of size $N$ in $d$ dimensions, for a given $r_{min}$ is given by

[TABLE]

where $J_{d/2}(.)$ is the Bessel function of order $d/2$ .

Similar to the previous case, we can determining the maximum achievable $r_{min}$ , that does not violate realizability conditions, for a given sample budget $N$ .

Lemma 7.

The maximum achievable $r_{min}$ using $N$ Step PCF samples in $d$ dimensions is equal to inverse of the $d$ -th root of the volume of a $d$ -dimensional hyper-sphere with radius $\sqrt[d]{N}$ ,

[TABLE]

where $\Gamma(.)$ is the gamma function. Equivalently, we can also determine the minimum $N$ required to achieve a given $r_{min}$ , $N=\frac{\Gamma(1+d/2)}{\pi^{d/2}r_{min}^{d}}$ .

For PDS sampling, the best-case generalization error can be obtained as:

[TABLE]

The worst-case generalization error can be obtained as:

[TABLE]

These integrals are complicated to compute and it is non-trivial to get clean and general bounds. Further simplifications under certain simplistic assumptions are provided in the supplementary material.

7 Convergence Analysis of Generalization Error

Next, we analyze the convergence of error with blue noise and PDS sampling patterns with sample size $N$ . This analysis will shed light into design principles for constructing sampling patterns.

7.1 Analysis with Sample Size

For random sampling patterns, both the best and the worst case generalization errors converge as $O\left(\frac{1}{N}\right)$ . For blue noise sampling, if best case functions/signals are bandwidth-limited with $\rho_{0}\leq\rho_{z}^{*}$ , then it can be perfectly recovered. However, when $\rho_{0}>\rho_{z}^{*}$ , the convergence is at the rate $O\left(\frac{1}{N}\right)$ , which is the same as random sampling. For worst case functions, the error converges as $O\left(\frac{1}{N\sqrt[d]{N}}\right)$ when $\rho_{0}\leq\rho_{z}^{*}$ and as $O\left(\frac{1}{N}\right)$ when $\rho_{0}>\rho_{z}^{*}$ . This provides a theoretical justification of designing a blue noise sampling pattern with a large zero-region $\rho_{z}$ for better performance. Note that the convergence rate analysis of Poisson disk sampling is not straightforward due to the involvement of Bessel functions under the integral in (26) and (27). Hence, we numerically analyze the convergence for PDS pattern. As showed in Fig. 1, We observe that the best case convergence rate approximately behaves as $O\left(\frac{1}{N\sqrt[d]{N}^{b}}\right)$ with $b\geq 1$ and the worst case convergence behaves as $O\left(\frac{1}{N}\right)$ .

7.2 Some Guidelines for Sample Design

Results from the convergence analysis suggest that an ideal sampling power spectrum must attain zero values in the low frequency regime. Note that the realizability conditions severely limit the range of realizable power spectra and hence in practice, this results in blue noise patterns with very small $\rho_{z}$ . Consequently, when the function is complex with $\rho_{0}>\rho_{z}^{*}$ , a blue noise sample design behaves similar to a random design, $O(1/N)$ . On the other hand, Poisson disk samples have a better error convergence rate even for complex functions compared to blue noise patterns. However, when $\rho_{0}\leq\rho_{z}^{*}$ , blue noise pattern is ideal. This suggests that an ideal sampling pattern should trade-off the two paradigms by developing a sampling pattern that simultaneously carries the blue noise and PDS properties.

In many practical scenarios, it is possible to use information acquired from previous observations to improve the sampling process. As more samples are obtained, one can learn how to improve the sampling process by deciding where to sample next. These sampling feedback techniques are more generally known as adaptive sampling in the statistics literature. Our analysis provides a novel way to quantify the value of sample in terms of generalization error. A natural extension of our work is towards building importance sampling techniques, guided by spectral properties.

8 Conclusions

In this paper, we develop a framework to study the interplay between the sampling properties and the generalization error. We expressed generalization error in terms of power spectra of sampling pattern and the function of interest. We also analyzed the generalization error of some state-of-the-art sampling pattern and quantified their gain over random sampler in a closed-form. Finally, we provided some design guidelines for constructing optimal sampling patterns for a given problem. There are still many interesting questions that remain to be explored in the future work such as an analysis of the generalization error for cases where data comes from non-linear manifolds. Note that some analytical methodologies used in this paper are certainly exploitable for studying the effect of sample design on generalization error in different manifolds. Other questions such as PSD/PCF parameterizations for other variants of space-filling designs, adaptive and importance sampling, and optimization approaches to synthesize them can also be investigated.

9 Acknowledgments

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

10 Appendix

11 Description of Sampling Distribution Families

In this paper, we consider three different families of sampling patterns for our generalization error analysis, namely random, blue noise and Poisson disk sampling. Figure 2 illustrates the point distributions along with their spectral/spatial properties for $d=2$ and $N=1000$ . Note that, we show the 2D PSD here, though our analysis assumes isotropic distributions and hence uses radially averaged 1D-PSD.

12 Proof for Lemma 1 from the main paper

The proof follows from [15] and provided here for completeness.

Let us denote the Fourier domain without the DC peak frequency as $\Theta$ . Since homogeneous sampling patterns have statistical properties that are invariant to translation, it is equivalent to studying the error due to the translated version of each realization, with the average computed over all translations. Formally, we can treat the torus as the group of translations, so that $\tau(S)$ denotes the translation of $S$ by an element $\tau\in\mathcal{T}^{d}$ . Then, averaging equation (7) over all translations of $S$ , we get:

[TABLE]

where the exponential arises from the translation of the sampling pattern by a vector $\tau$ in the Fourier domain. When $\omega\neq\omega^{\prime}$ , the integral of the exponential part equals zero, so that only the case $\omega=\omega^{\prime}$ contributes to the variance. Hence, we can remove one integral over $\Theta$ and obtain

[TABLE]

Finally, denoting the power spectrum of the loss by $\mathcal{P}_{l}$ and the power spectrum of the sampling pattern normalized by $N$ as $\mathcal{P}_{S}$ , and leveraging the fact that $\|\mathcal{F}_{S,l}(\omega,\omega)\|^{2}=\|\mathcal{F}_{S}(\omega)\|^{2}\cdot\|\mathcal{F}_{l}(\omega)\|^{2}$ ,

[TABLE]

This provides the expression for the generalization error in terms of the power spectra of both the sampling pattern and the loss function in the toroidal domain.

13 Proof of Theorem 1 from the main paper

We know that the PSD and PCF of a point distribution are related via the Fourier transform as follows:

[TABLE]

where $F(.)$ denotes the $d$ -dimensional Fourier transform. Using symmetry of the Fourier transform, we have

[TABLE]

Next, we use polar coordinates with the $z$ axis along $\mathbf{k}$ , so that $\mathbf{k.r}=\rho r\cos\theta$ where $\rho=|\mathbf{k}|$ and $r=|\mathbf{r}|$ . For radially symmetric PCF, we have $G(\mathbf{r})=G(r)$ and the above relationship can be rewritten as

[TABLE]

where $\omega$ is the area of unit sphere in $(d-1)$ dimension. Next, using the identity involving bessel function of order $v$ , i.e.,

[TABLE]

we obtain

[TABLE]

14 Proof of Lemma 5 from the main paper

Note that, for a Step blue noise configuration to be realizable, it is sufficient to show that the corresponding PCF is non-negative. Thus, we have

[TABLE]

In the last inequality, we have used the following approximation

[TABLE]

15 Generalization Error Bounds for Poisson Disk Sampling Patterns

Best Case

[TABLE]

The second inequality above is based on the series form of the hypergeometric function and the assumption that $N$ is a large number.

16 Convergence Analysis of Generalization Error with Dimensions

In this section, we report some interesting observations when analyzing generalization error with increasing dimensions. We study the limiting behavior of $\rho_{z}^{*}$ and $r_{min}^{*}$ as $d$ approaches infinity. We show that the analysis with conventional metrics to characterize the zero region, i.e., the range of frequencies that can be represented with no aliasing, provides some rather counter-intuitive results.

Lemma 8.

As the dimension $d$ approaches infinity, the maximum achievable zero region for blue noise sampling, with a fixed $N$ , goes to infinity, i.e., $\lim_{d\rightarrow\infty}\rho_{z}^{*}=\infty$ and, the minimum number of samples needed to achieve a zero region $\rho_{z}$ approaches zero, i.e., $\lim_{d\rightarrow\infty}N=0.$

Intuitively, with gowing $d$ , one might expect $\rho_{z}^{*}\rightarrow 0$ and $N\rightarrow\infty$ . To better understand this result, we study the relationship between these two quantities and the volume of a hyper-sphere. One of the surprising facts about a sphere in high dimensions is that as the dimension increases, the volume of the sphere goes to zero which justifies the above results. Our intuitions about space are formed in two or three dimensions and often do not hold in high dimensions. A more surprising fact is that $\rho_{z}^{*}$ and $N$ are not monotonic functions with respect to $d$ (see Figure 3 and 3). Either a steady increase or a steady decrease seems more plausible than having these two quantities grow for a while, then reach a peak at some finite value of $d$ , and thereafter decline. This behavior has also been observed in high dimensional geometry while analyzing the volume of a hypersphere, however, no physical interpretation or intuition currently exists for this open research problem [8].

Similarly, we study the asymptotic behavior of the maximum achievable $r_{min}$ for a fixed sample budget, and equivalently the minimum number of samples required to achieve a PDS with a given $r_{min}$ , as the dimension grows to infinity.

Lemma 9.

As the dimension $d$ approaches infinity, the maximum achievable $r_{min}$ for PDS sampling pattern, with a fixed number of samples, goes to infinity, i.e., $\lim_{d\rightarrow\infty}r_{min}^{*}=\infty$ and, the minimum number of samples needed to achieve a $r_{min}$ also approaches infinity, i.e., $\lim_{d\rightarrow\infty}N=\infty.$

The results in the lemma above are reasonable, since the space is growing exponentially fast.

16.1 Analysis with Proposed Metrics

Analysis with the metrics $\rho_{z}^{*}$ and $r_{min}^{*}$ , which are based on the amplitude of the frequency vector, i.e., $\mathbf{k}$ , to characterize the zero region, leads to inconsistent results in high dimensions. We argue that comparing $\rho_{z}^{*}$ and $r_{min}^{*}$ across different dimensions is not accurate, and these inconsistent results are a byproduct of the improper comparisons. Note that, each $d$ -dimensional space is comprised of a different range of frequency components, and comparing the magnitude of the frequency vector directly across dimensions is questionable. In particular, for a valid comparison of volumes across dimensions, we propose to measure them in terms of a standard volume in that dimension, i.e., unit hypercube or the measure polytope, which has a volume of $1$ in all dimensions. Further, as the dimension $d$ increases, the maximum possible distance between two points in a hypercube grows as $\sqrt{d}$ . Consequently, to have same scale across dimensions, we normalize the radius of the hypersphere by the factor $\sqrt{d}$ . In summary, we introduce the relative zero region, i.e., $\hat{\rho_{z}}^{*}=\rho_{z}^{*}/\sqrt{d}$ ( $\hat{r}_{min}^{*}=r_{min}^{*}/\sqrt{d}$ ) for meaningful convergence analysis across dimensions.

Lemma 10.

As dimension $d$ approaches infinity, the maximum achievable relative $\rho_{z}^{*}$ converges to a constant, i.e.,

[TABLE]

and, the minimum number of blue noise samples needed to achieve $\hat{\rho}_{z}^{*}$ goes to infinity.

Proof.

To prove the first identity, note that $\dfrac{\rho_{z}^{*}}{\sqrt{(}d)}=\dfrac{\sqrt[d]{N}}{\sqrt{\pi d}}\sqrt[d]{\Gamma\left(1+\frac{d}{2}\right)}$ and invoke Stirling’s approximation, i.e., $\Gamma(1+m)=\left(\frac{m}{e}\right)^{m}\sqrt{2\pi m}$ . Now, the required result can be obtained by letting $d$ approach infinity. The second identity can be proved in a similar manner. ∎

Similarly, we study the asymptotic behavior of $\hat{r}_{min}^{*}$ for PDS sampling pattern.

Lemma 11.

As dimension $d$ approaches infinity, the maximum achievable relative ${r}_{min}^{*}$ converges to a constant, i.e.,

[TABLE]

and, the minimum number of PDS samples needed to achieve $\hat{r}_{min}^{*}$ goes to infinity.

The results in Lemmas 10 and 11 show interesting limiting behaviors of both blue noise and PDS sampling distributions.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alabdulmohsin [2015] I. M. Alabdulmohsin. Algorithmic stability and uniform generalization. In Advances in Neural Information Processing Systems , pages 19–27, 2015.
2Boucheron et al. [2005] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics , 9:323–375, 2005.
3Bousquet and Elisseeff [2002] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research , 2(Mar):499–526, 2002.
4Brandolini et al. [2001] L. Brandolini, L. Colzani, and A. Torlaschi. Mean square decay of fourier transforms in euclidean and non euclidean spaces. Tohoku Mathematical Journal, Second Series , 53(3):467–478, 2001.
5Caflisch [1998] R. E. Caflisch. Monte carlo and quasi-monte carlo methods. Acta Numerica , 7:1–49, 1998. doi: 10.1017/S 0962492900002804 .
6Durand [2011] F. Durand. A frequency analysis of monte-carlo and other numerical integration schemes. 2011.
7Garud et al. [2017] S. S. Garud, I. A. Karimi, and M. Kraft. Design of computer experiments: A review. Computers and Chemical Engineering , 106(Supplement C):71 – 95, 2017. ISSN 0098-1354. ESCAPE-26.
8Hayes [2011] B. Hayes. An adventure in the nth dimension. American Scientist. , 99(6):442–446, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Look at the Effect of Sample Design on Generalization through the Lens of Spectral Analysis

Abstract

1 Introduction

2 Preliminaries - Spectral Analysis for Sampling

Theorem 1**.**

Proof.

3 Risk Minimization using Monte Carlo Estimates

4 Connecting Generalization Error with Spectral Properties of Samples

4.1 Monte Carlo Estimator of Risk in the Spectral domain

4.2 Generalization Error via the Spectral Analysis

Lemma 1**.**

Proof.

Theorem 2**.**

5 Best and Worst Case Generalization Error

5.1 Best-Case Generalization Error

Lemma 2**.**

5.2 Worst-Case Generalization Error

Lemma 3**.**

6 Sampler-Specific Generalization Error Bounds

Lemma 4** ([11]).**

Lemma 5**.**

Proof.

Lemma 6** ([12]).**

Lemma 7**.**

7 Convergence Analysis of Generalization Error

7.1 Analysis with Sample Size

7.2 Some Guidelines for Sample Design

8 Conclusions

9 Acknowledgments

10 Appendix

11 Description of Sampling Distribution Families

12 Proof for Lemma 1 from the main paper

13 Proof of Theorem 1 from the main paper

14 Proof of Lemma 5 from the main paper

15 Generalization Error Bounds for Poisson Disk Sampling Patterns

Best Case

16 Convergence Analysis of Generalization Error with Dimensions

Lemma 8**.**

Lemma 9**.**

16.1 Analysis with Proposed Metrics

Lemma 10**.**

Proof.

Lemma 11**.**

Theorem 1.

Lemma 1.

Theorem 2.

Lemma 2.

Lemma 3.

Lemma 4 ([11]).

Lemma 5.

Lemma 6 ([12]).

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.