Efficient Kirszbraun Extension with Applications to Regression

Hanan Zaichyk; Armin Biess; Aryeh Kontorovich; Yury Makarychev

arXiv:1905.11930·cs.LG·March 10, 2022

Efficient Kirszbraun Extension with Applications to Regression

Hanan Zaichyk, Armin Biess, Aryeh Kontorovich, Yury Makarychev

PDF

Open Access

TL;DR

This paper presents a novel regression framework between Hilbert spaces using Kirszbraun's extension theorem, offering improved computational efficiency and empirical performance in supervised learning tasks.

Contribution

It introduces the first application of Kirszbraun's extension to supervised learning, with a new MWU algorithm that improves runtime and performance.

Findings

01

Quadratic runtime improvement over existing methods

02

Significant empirical performance gains

03

Effective decomposition into training and prediction stages

Abstract

We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun's extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves a quadratic runtime improvement over the state of the art. Our empirical results indicate a dramatic improvement over standard off-the-shelf solvers in our setting.

Tables11

Table 1. Table 1 : ERM of the smoothing process

training points	20	100	200	500	1000
Algorithm	Avg. loss
MWU	247.9405	0.3333	0.31581	0.31854	0.36143
IntPt	4.1e-18	46023.7964	353691.64

Table 2. Table 2 : Cross validation running time over the smoothing process in seconds

training points	20	100	200	500	1000
Algorithm	Avg. loss
MWU	2.7505	19.9428	46.2479	212.4875	1243.2395
IntPt	18.1733	692.6523	4087.6655

Table 3. Table 3 : Single ∗ smoothing process running time in seconds

training points	20	100	200	500	1000
Algorithm	Avg. loss
MWU	0.092	0.7051	1.6326	8.0798	45.1497
IntPt	2.474	155.5521	766.5433

Table 4. Table 4 : Extension avg loss

training points	100	200	500	1000
Algorithm	Avg. loss
MWU	1119.4705	0.3333	0.37267	0.43678	0.52797
IntPt	3065.5698	9475.260	9475.4864

Table 5. Table 5 : Extension running time for all test points in seconds

training points	20	100	200	500	1000
Algorithm	Avg. loss
MWU	0.054	0.0014903	0.002826	0.0051038	0.0089853
IntPt	1.3367	1.6918	2.6156

Table 6. Table 6 : Visual comparison between our MWU based algorithm (first row) and IntPt (Matlab’s) based algorithm (second row). For f = x 3 𝑓 superscript 𝑥 3 f=x^{3} and N = 100 𝑁 100 N=100 random points. In All graphs the blue line represents the ground truth function f = x 3 𝑓 superscript 𝑥 3 f=x^{3} while the orange ’x’ symbols represent the estimation of the data points by the learned function. It is possible to see that while the MWU based algorithm was able to fit both the training and test set in high accuracy, the Intpt method has several “heavy” outliers which reduce significatly its average squared error

Table 7. Table 7 : ERM of the smoothing process

training points	100	200	500	1000
Algorithm	Avg. loss
MWU	5.5e-09	1.6639e-08	3.7683e-08	6.8339e-08
IntPt	5.3e-16	5913495.3623

Table 8. Table 8 : Cross validation running time over the smoothing process in seconds

training points	100	200	500	1000
Algorithm	Avg. loss
MWU	5.1933	11.3827	66.5677	335.0659
IntPt	3508.4048	3936.6494

Table 9. Table 9 : Single ∗ smoothing process running time in seconds

training points	100	200	500	1000
Algorithm	Avg. loss
MWU	0.78109	1.8081	8.1855	48.5127
IntPt	114.5577	778.1016

Table 10. Table 10 : Extension avg loss

training points	100	200	500	1000
Algorithm	Avg. loss
MWU	5.5e-09	1.09e-08	3.3e-08	7.3e-08
IntPt	0.2877	0.83248

Table 11. Table 11 : Extension running time for a all test set in seconds

training points	100	200	500	1000
Algorithm	Avg. loss
MWU	0.0026	0.0038	0.0061	0.0095
IntPt	4.3164	1.824

Equations92

\displaystyle\mathrm{Minimize}\

\displaystyle\mathrm{Minimize}\

\displaystyle\mathrm{subject}\ to\ \

(1/ ε)^{O (a)} b lo g n lo g lo g n .

(1/ ε)^{O (a)} b lo g n lo g lo g n .

1 - h_{i} (y)

1 - h_{i} (y)

i = 1 \sum n w_{i} h_{i} (y) \geq 0.

i = 1 \sum n w_{i} h_{i} (y) \geq 0.

P = i = 1 \sum n \frac{w _{i}}{∥ x ^{*} - x _{i} ∥ ^{2}} and p_{i} = \frac{w _{i}}{P ∥ x ^{*} - x _{i} ∥ ^{2}},

P = i = 1 \sum n \frac{w _{i}}{∥ x ^{*} - x _{i} ∥ ^{2}} and p_{i} = \frac{w _{i}}{P ∥ x ^{*} - x _{i} ∥ ^{2}},

Q = i = 1 \sum n \frac{w _{i}}{∥ x ^{*} - x _{i} ∥} and q_{i} = \frac{w _{i}}{Q ∥ x ^{*} - x _{i} ∥} .

Q = i = 1 \sum n \frac{w _{i}}{∥ x ^{*} - x _{i} ∥} and q_{i} = \frac{w _{i}}{Q ∥ x ^{*} - x _{i} ∥} .

z = \frac{∥ x ^{\circ} - x ^{*} ∥}{∥ z _{0} - y ^{\circ} ∥} z_{0} + (1 - \frac{∥ x ^{\circ} - x ^{*} ∥}{∥ z _{0} - y ^{\circ} ∥}) y^{\circ} .

z = \frac{∥ x ^{\circ} - x ^{*} ∥}{∥ z _{0} - y ^{\circ} ∥} z_{0} + (1 - \frac{∥ x ^{\circ} - x ^{*} ∥}{∥ z _{0} - y ^{\circ} ∥}) y^{\circ} .

∥ y^{*} - y_{i} ∥^{2} / ∥ x^{*} - x_{i} ∥^{2} \leq 1,

∥ y^{*} - y_{i} ∥^{2} / ∥ x^{*} - x_{i} ∥^{2} \leq 1,

Q i = 1 \sum n q_{i} ∥ z - y_{i} ∥

Q i = 1 \sum n q_{i} ∥ z - y_{i} ∥

\leq (i = 1 \sum n w_{i} \frac{∥ z - y _{i} ∥ ^{2}}{∥ x ^{*} - x _{i} ∥ ^{2}} i = 1 \sum n w_{i})^{1/2}

\leq (i = 1 \sum n w_{i} \frac{∥ y ^{*} - y _{i} ∥ ^{2}}{∥ x ^{*} - x _{i} ∥ ^{2}})^{1/2} \leq 1.

∥ y_{i} - y^{*} ∥

∥ y_{i} - y^{*} ∥

\leq L ∥ x_{i} - x_{j} ∥ + (1 + ε) L ∥ x_{j} - x^{*} ∥

\leq L ((2 + ε) ∥ x_{i} - x_{j} ∥ + (1 + ε) ∥ x_{i} - x^{*} ∥)

\leq (1 + 3 ε + ε^{2}) M ∥ x_{i} - x^{*} ∥,

∥ y_{i} - y^{*} ∥

∥ y_{i} - y^{*} ∥

\leq L ∥ x_{i} - x^{\circ} ∥ + (1 + ε) L ∥ x^{\circ} - x^{*} ∥

\leq L (∥ x_{i} - x^{*} ∥ + (2 + ε) ∥ x^{\circ} - x^{*} ∥)

\leq (1 + ε)^{2} L ∥ x_{i} - x^{*} ∥.

∥ p - x_{j} ∥ \leq (1 + ε /3) ∥ x_{i} - p ∥ \leq (1 + ε /3) ε r /3

∥ p - x_{j} ∥ \leq (1 + ε /3) ∥ x_{i} - p ∥ \leq (1 + ε /3) ε r /3

∥ x_{i} - x_{j} ∥

∥ x_{i} - x_{j} ∥

\leq ε r /3 + (1 + ε /3) ε r /3

\leq (2 + ε /3) ε r /3 \leq 3∥ x^{*} - x_{i} ∥,

O (ma + \frac{m ^{3/2} ( lo g m ) ^{2} b lo g ( 1/ ε )}{ε ^{5/2}}),

O (ma + \frac{m ^{3/2} ( lo g m ) ^{2} b lo g ( 1/ ε )}{ε ^{5/2}}),

\displaystyle\mathrm{Minimize}\

\displaystyle\mathrm{Minimize}\

\displaystyle\mathrm{subject}\ to\ \

minimize Ψ (Y, Y, {λ_{i}}, {μ_{ij}}) \equiv

minimize Ψ (Y, Y, {λ_{i}}, {μ_{ij}}) \equiv

i = 1 \sum n λ_{i} ∥ y_{i} - \tilde{y}_{i} ∥^{2} + (i, j) \in E \sum μ_{ij} ∥ \tilde{y}_{i} - \tilde{y}_{j} ∥^{2} .

(L + Λ) Y^{⊤} = Λ Y^{⊤} .

(L + Λ) Y^{⊤} = Λ Y^{⊤} .

Φ (Y, Y^{*}) \leq Φ_{0} \leq (1 + ε) Φ (Y, Y^{*})

Φ (Y, Y^{*}) \leq Φ_{0} \leq (1 + ε) Φ (Y, Y^{*})

h_{Φ} (Y)

h_{Φ} (Y)

h_{ij} (Y)

Φ (Y, Y) \leq (1 + ε)^{2} Φ_{0}

Φ (Y, Y) \leq (1 + ε)^{2} Φ_{0}

O (ma + m^{3/2} b (lo g n)^{2} lo g (1/ ε) / ε^{5/2}),

O (ma + m^{3/2} b (lo g n)^{2} lo g (1/ ε) / ε^{5/2}),

\exists ? x \in P : \forall i \in [m] : f_{i} (x) \geq 0,

\exists ? x \in P : \forall i \in [m] : f_{i} (x) \geq 0,

\exists ? x \in P : i \sum p_{i} f_{i} (x) \geq 0.

\exists ? x \in P : i \sum p_{i} f_{i} (x) \geq 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNumerical methods in engineering · Domain Adaptation and Few-Shot Learning · Sparse and Compressive Sensing Techniques

Full text

Efficient Kirszbraun Extension with Applications to Regression

Hanan Zaichyk

Ben-Gurion University

[email protected]

Armin Biess

Ben-Gurion University

[email protected]

Aryeh Kontorovich

Ben-Gurion University

[email protected]

Yury Makarychev

Toyota Technological Institute at Chicago

[email protected]

Abstract

We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun’s extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves a quadratic runtime improvement over the state of the art. Our empirical results indicate a dramatic improvement over standard off-the-shelf solvers in our setting.

1 Introduction.

Regression.

The classical problem of estimating a continuous-valued function from noisy observations, known as regression, is of central importance in statistical theory with a broad range of applications; see, for example, Györfi et al. (2006); Nadaraya (1989). When the target function is assumed to have a specific structure, the regression problem is termed parametric and the optimization problem is finite-dimensional. Linear regression (Mohri et al., 2012, chapter 10.3.1) is perhaps the simplest and most common type of parametric regression. When no structural assumptions concerning the target function are made, the regression problem is nonparametric. Informally, the main objective in the study of nonparametric regression is to understand the relationship between the regularity conditions that a function class might satisfy (e.g., Lipschitz or Hölder continuity, or sparsity in some representation) and its behavior vis-à-vis optimization and generalization. Most existing algorithms for regression either focus on the scalar-valued case or else reduce multiple outputs to several scalar problems (Borchani et al., 2015), see Related Work.

Convex optimization.

Many learning problems can be cast in the framework of convex optimization. In particular, regression naturally lends itself to this formulation. While some cases, such as linear regression, admit efficient closed form solutions, this is not the case in general. Typically, convex optimization problems are solved via iterative methods up to a specified accuracy. One general approach is the interior-point methods, which, on problems with $n$ variables and $m$ constraints achieves a runtime of $O(\max\{n^{3},n^{2}m,F\})$ , where $F$ id the cost of evaluating the first and second derivatives of the objective and the constrains (Boyd and Vandenberghe, 2004).

Motivation and contribution.

The chief motivation of this work was to generalize the results of Gottlieb et al. (2017), who provided efficient nonparametric regression methods in the scalar-output case. Attempts to numerically solve our optimization problem, which is naturally formulated as a Quadratically Constrained Quadratic Program (QCQP), via state-of-the-art off the shelf solvers indicated that these are incapable of handling our framework, even for relatively small data sets and dimensions. This limitation of QCQP solvers motivated us to develop a specialized algorithm to solve the optimization problem entailed by our regression setting.

In Section 4, we show that our specialized algorithm dramatically outperforms general-purposed QCQP solvers; this algorithm, its theoretical analysis, and MATLAB code111 available at https://github.com/HananZaichyk/Kirszbraun-extension

are the main contributions of this paper. We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun’s extension theorem — to the best of our knowledge, the first application of this technique to supervised learning. This method directly exploits the metrics of the input and output spaces, which makes it explicitly sensitive to the interaction among the output components. Although our main contributions are algorithmic, a statistical analysis of our regression technique is provided in Appendix C.

We formulate the regression problem in two stages: smoothing and extension, which are formally described in Section 3. Roughly speaking, on a dataset of size $n$ with $a$ input and $b$ output dimensions, we formulate the smoothing problem as a Quadratically Constrained Quadratic Program (QCQP) problem with $bn$ variables and $O(n^{2})$ constraints. The extension problem is also formulated as a QCQP with $O(n)$ variables and $O((a+b)n)$ constraints.

Although general QCQP problems are not convex, our special instance is, and as such is, in principle, amenable to the standard convex optimization framework, such as interior point methods. When solving large-scale problems, even a modest improvement in the exponent yields dramatic runtime savings. We propose a Multiplicative Weight Update (MWU) scheme to solve the smoothing problem, to a constant precision, in runtime $O(ma+{m^{3/2}(\log m)^{2}b})$ and the extension problem in runtime $O(na+nb\log n)$ .

Related work.

Previous approaches to vector-valued regression include $\varepsilon$ -insensitive SVM with $p$ -norm regularization (Brudnak, 2006), least-squares and MLE-based methods (Jain and Tewari, 2015), and (for linear models) the Danzig selector (Chen and Banerjee, 2018). According to a recent survey (Borchani et al., 2015), existing methods essentially “transform the multi-output problem into independent single-output problems.” Some approaches to multitask learning problems (Caruana, 1997) exploit relations between the different tasks. In econometrics, this decoupling of the outputs is made explicit in the Seemingly Unrelated Regressions (SUR) model (Davidson et al., 1993; Greene, 2003, 2012). These approaches however, do not seem to encapsulate the need of a single vector output with possibly strong relations between its coordinates. In our approach, we devise a principled approach for leveraging the dependencies via Kirszbraun extension. The latter has previously been applied by Mahabadi et al. (2018) to dimensionality reduction (unsupervised learning), but to the best of our knowledge has not been used in the supervised learning setting.

Both of our problems (smoothing and extension) may be formulated as QCQP programs, whose most general form is

[TABLE]

where $a$ , $a_{i\in[m]}$ and $x$ are vectors, $P_{0},P_{i\in[m]}$ are matrices, and the $b_{i}$ are scalars. The general problem is NP-hard, but when all of the $P_{i}$ are semi-definite, the problem is convex and can be solved in polynomial time (Boyd and Vandenberghe, 2004). The QCQP is usually solved in practice using log-barrier or primal-dual interior-point methods. The running time of an optimization algorithm based on the interior-point methods significantly depends on the problem at hand. Specifically, consider a problem with $N$ variables and $m$ constraints. In order to obtain a $(1+\varepsilon)$ -approximate solution, the algorithm has to perform $\Theta(\sqrt{m}\log(1/\varepsilon))$ iterations in the worst-case (Nesterov and Nemirovskii, 1994, Chapter 6). In each iteration, the algorithm has to initialize and invert an $N\times N$ Hessian matrix (or equivalently solve a system of $N$ linear equations with $N$ variables). The time required to initialize the Hessian matrix is problem specific: while it is $O(mN^{2})$ in the worst case, it is often significantly less than that. The Hessian matrix can be inverted in $O(N^{\omega})$ time, where $\omega$ is the matrix multiplication exponent (Bunch and Hopcroft, 1974) (the best current upper bound on $\omega$ is $2.37286$ (Alman and Williams, 2021)). However, to the best of our knowledge, all implementations used in practice perform this step in $O(N^{3})$ time. That said, this step can be significantly sped up if the Hessian matrix has a special structure.

Our Multiplicative Weight Update (MWU) scheme is based on the framework of Arora et al. (2012). We include the relevant background and results in the Appendix for completeness.

Main results.

We cast the general regression problem between Hilbert spaces as two QCQP programs, and provide an efficient algorithm for each problem.

The problem setup, formalized in Section 2, involves a dataset of size $n$ of vectors in an $a$ -dimensional Euclidean space labeled by $b$ -dimensional vectors. The smoothing (also: training, regularization, denoising) problem (Section 3.2) is to perturb the labels so as to achieve the user-specified Lipschitz smoothness constraint while incurring a minimum distortion. This is a standard statistical technique, known as regularization, which prevents overfitting in prediction. Our Theorem 3.5 solves the smoothing optimization problem, up to a tolerance $\varepsilon$ , in runtime $O(an^{2}+bn^{3}(\log n)^{2}\log(1/\varepsilon)/\varepsilon^{5/2})$ .

Next, we address the task of prediction (i.e., assigning a label to a test point). In Theorem 3.1, we accomplish this via $\varepsilon$ -approximate Kirszbraun extension of the smoothed dataset, in runtime $O(an+bn(\log n)/\varepsilon^{2})$ . For small $a$ , an improvement is possible: a data structure can be constructed off-line at a (once) runtime cost of $2^{O(a)}n\log n$ that allows to answer (multiple) future prediction queries in time

[TABLE]

In Section 4, we compare the performance of our MWU-based approach to a state of the art interior-point based solver and report a significant runtime advantage, which allows to process larger samples and ultimately yields greater accuracy.

Finally, for completeness, in Section C, we include a Rademacher-based analysis of the generalization error of our regression algorithm.

2 Formal setup.

Metric space.

A metric space $(X,d_{X})$ is a set $X$ equipped with a symmetric function $d_{X}:X^{2}\to[0,\infty)$ satisfying $d_{X}(x,x^{\prime})=0\iff x=x^{\prime}$ and the triangle inequality. Given two metric spaces $(X,d_{X})$ and $(Y,d_{Y})$ , a function $f:X\to Y$ is $L$ -Lipschitz if $d_{Y}(f(x),f(x^{\prime}))\leq Ld_{X}(x,x^{\prime})$ for all $x,x^{\prime}\in X$ ; its Lipschitz constant $\left\|f\right\|_{\textrm{{\tiny{Lip}}}}$ is the smallest $L$ for which the latter inequality holds. For any metric space $(X,d_{X})$ and $A\subseteq X$ , the following classic Lipschitz extension result, essentially due to McShane (1934); Whitney (1934), holds. If $f:A\to\mathbb{R}$ is Lipschitz (under the inherited metric) then there is an extension $f^{*}:X\to\mathbb{R}$ that coincides with $f$ on $A$ and $\left\|f\right\|_{\textrm{{\tiny{Lip}}}}=\left\|f^{*}\right\|_{\textrm{{\tiny{Lip}}}}$ . A Hilbert space $H$ is a vector space (in our case, over $\mathbb{R}$ ) equipped with an inner product $\langle\cdot,\cdot\rangle:H^{2}\to\mathbb{R}$ , which is a positive-definite symmetric bilinear form; further, $H$ is complete in the metric $d_{H}(x,x^{\prime}):=\sqrt{\langle x-x^{\prime},x-x^{\prime}\rangle}$ .

Kirszbraun theorem.

Kirszbraun (1934) proved that for two Hilbert spaces $(X,d_{X})$ and $(Y,d_{Y})$ , and $f$ mapping $A\subseteq X$ to $Y$ , there is an extension $f^{*}:X\to Y$ such that $\left\|f\right\|_{\textrm{{\tiny{Lip}}}}=\left\|f^{*}\right\|_{\textrm{{\tiny{Lip}}}}$ . This result is in general false for Banach spaces whose norm is not induced by an inner product (Naor, 2015).

Learning problem.

We assume a familiarity with the abstract agnostic learning framework and refer the reader to Mohri et al. (2012) for background. Our approach will be applied to learn a mapping between two Hilbert spaces, $(X,d_{X})$ and $(Y,d_{Y})$ . We assume a fixed unknown distribution $P$ on $X\times Y$ and a labeled sample $(x_{i},y_{i})_{i\in[n]}$ of input-output examples. The risk of a given mapping $f:X\to Y$ is defined as $R(f)=\mathbb{E}_{(x,y)\sim P}[d_{Y}(f(x),y)]$ ; implicit here is our designation of the metric of $Y$ as the loss function. Analogously, the empirical risk of $f$ on a labeled sample is given by $\hat{R}_{n}(f)=n^{-1}\sum_{i\in[n]}d_{Y}(f(x_{i}),y_{i})$ . In this paper, we always take $X=\mathbb{R}^{a}$ and $Y=\mathbb{R}^{b}$ , each equipped with the standard Euclidean metric. Uniform deviation bounds on $|R(f)-\hat{R}_{n}(f)|$ , over all $f$ with $\left\|f\right\|_{\textrm{{\tiny{Lip}}}}\leq L$ are given in Section C.

3 Learning algorithm

Overview.

We follow the basic strategy proposed by Gottlieb et al. (2017) for real-valued regression. We are given a labeled sample $(x_{i},y_{i})_{i\in[n]}$ , where $x_{i}\in X:=\mathbb{R}^{a}$ and $y_{i}\in Y:=\mathbb{R}^{b}$ . For a user-specified Lipschitz constant $L>0$ , we compute the (approximate) Empirical Risk Minimizer (ERM) $\hat{f}:=\operatorname*{argmin}_{f\in F_{L}}\hat{R}_{n}(f)$ over $F_{L}:=\{f\in Y^{X}:\left\|f\right\|_{\textrm{{\tiny{Lip}}}}\leq L\}$ . (A standard method for tuning $L$ is via Structural Risk Minimization (SRM): One computes a generalization bound $R(\hat{f})\leq\hat{R}_{n}(\hat{f})+Q_{n}(a,b,L)$ , where $Q_{n}(a,b,L):=\sup_{f\in F_{L}}|R(f)-\hat{R}_{n}(f)|=O(L/n^{a+b+1})$ , as derived in in Section C, and chooses $\hat{L}$ to minimize this. We omit this standard stage of the learning process.)

Predicting the value at a test point $x^{*}\in X$ amounts to Lipschitz-extending $\hat{f}$ from $\left\{x_{i}:i\in[n]\right\}$ to $\left\{x_{i}:i\in[n]\right\}\cup\left\{x^{*}\right\}$ . Equivalently, the ERM stage may be viewed as a smoothing procedure, where $\tilde{y}_{i}:=\hat{f}(x_{i})$ and $(x_{i},\tilde{y}_{i})_{i\in[n]}$ is the smoothed sample — which is then (approximately) Lipschitz-extended to $x^{*}$ . We proceed to describe each stage in detail.

3.1 Approximate Lipschitz extension

Problem statement.

Given a finite sequence $(x_{i})_{i\in[n]}\subset X=\mathbb{R}^{a}$ , its image $(y_{i})_{i\in[n]}\subset Y=\mathbb{R}^{b}$ under some $L$ -Lipschitz map $f:X\to Y$ , a test point $x^{*}$ , and a precision parameter $\varepsilon>0$ , we wish to compute $y^{*}=f(x^{*})$ so that $\left\|y^{*}-f(x_{i})\right\|\leq(1+\varepsilon)L\left\|x^{*}-x_{i}\right\|$ for all $i\in[n]$ . Our first result is an efficient algorithm for achieving this:

Theorem 3.1.

The approximate Lipschitz extension algorithm OnePointExtension has runtime $O(na+nb\log n/\varepsilon^{2})$ .

The query runtime can be significantly improved if the dimension of $X$ is moderate:

Theorem 3.2.

There is a data structure for the Lipschitz extension problem of memory size $O(2^{O(a)}n)$ that can be constructed in time $O(2^{O(a)}n\log n)$ . Given a query point $x^{*}$ and a parameter $\varepsilon\in(0,1/2)$ , one can compute $y^{*}$ such that $\|y^{*}-y_{i}\|\leq(1+\varepsilon)L\|x^{*}-x_{i}\|$ for every $i$ in time $(1/\varepsilon)^{O(a)}b\log n\log\log n$ .

Analysis.

We analyze algorithm OnePointExtension 1 and prove Theorems 3.1 and 3.2 via the multiplicative update framework of Arora et al. (2012). In particular, we will invoke their Theorem 3.4, which, for completeness, is reproduced in Section A as Theorem A.1. To simplify the notation, we assume (without loss of generality) that $L:=\left\|f\right\|_{\textrm{{\tiny{Lip}}}}=1$ . Let $\mathcal{P}=\operatorname{Ball}(y^{\circ},\|x^{\circ}-{x^{*}}\|)$ , and define $h_{i}(y)=1-\frac{||y-y_{i}||}{||x^{*}-x_{i}||}$ for $y\in\mathcal{Y}$ $i\in\{1,\dots,n\}$ . Then the Lipschitz extension problem is equivalent to the following: find $y\in\mathcal{P}$ such that $h_{i}(y)\geq 0$ for all $i\in[n]$ . Note that functions $h_{i}$ are concave and thus the problem is in the form of (3.8) from Arora et al. (2012). We now bound the “width” of the problem, proving that $h_{i}(y)\in[-2,1]$ for every $y\in\mathcal{P}$ (in the notation from Arora et al. (2012), we show that $\ell\leq 1$ and $\rho\leq 2$ ). Observe that for every $y\in\mathcal{P}$ and every $i$ , we have (i) $h_{i}(y)\leq 1$ as $\frac{\|y-y_{i}\|}{\|{x^{*}}-x_{i}\|}\geq 0$ and (ii)

[TABLE]

Here, we used that $\|y-y^{\circ}\|\leq\|x^{\circ}-{x^{*}}\|$ (which is true since $y\in\mathcal{P}$ ), $\|y^{\circ}-y_{i}\|\leq\|x^{\circ}-x_{i}\|$ (which is true since $f$ is 1-Lipschitz), and $\|{x^{*}}-x^{\circ}\|\leq\|{x^{*}}-x_{i}\|$ (which is true since $x^{\circ}$ is the point closest to ${x^{*}}$ among all points $x_{1},\dots,x_{n}$ ). We conclude that $h_{i}(y)\in[-2,1]$ .

To apply Theorem A.1, we design an oracle for the following problem:

Problem 3.3.

Given non-negative weights $w_{i}$ that add up to $1$ , find $y\in\mathcal{P}$ such that

[TABLE]

Note that Problem 3.3 has a solution, since ${y^{*}}$ , the Lipschitz extension of $f$ to ${x^{*}}$ (whose existence is guaranteed by the Kirzsbraun theorem), satisfies (1). Define auxiliary weights $p_{i}$ and $q_{i}$ as follows:

[TABLE]

The oracle finds and outputs $z\in\mathcal{P}$ that minimizes $V(z)=\sum_{i=1}^{n}p_{i}\|z-y_{i}\|^{2}$ . To this end, it first computes $z_{0}=\sum_{i=1}^{n}p_{i}y_{i}$ . Note that $V(z)=\|z-z_{0}\|^{2}+\sum_{i=1}^{n}p_{i}\|z_{0}-y_{i}\|^{2}.$ Then, if $z_{0}\in\mathcal{P}$ , it sets $z=z_{0}$ ; otherwise, $z$ is set to be the point closest to $z_{i}$ in $\mathcal{P}$ , which is

[TABLE]

This $z$ is computed on lines 6–8 of the algorithm. We verify that $z$ satisfies condition (1). Rewrite condition (1) in terms of weights $q_{i}$ : $Q\sum_{i=1}^{n}q_{i}\,\|y-y_{i}\|\leq 1$ . Using that

[TABLE]

we get

[TABLE]

The first inequality is due to Cauchy–Schwarz, and the second holds since $V(z)\leq V(y^{*})$ .

Proof.

Proof of Theorem 3.1. From Theorem A.1, we get that the algorithm finds a $1+\varepsilon$ approximate solution in $T=\frac{8\rho\ell\ln m}{\varepsilon^{2}}=\frac{16\ln m}{\varepsilon^{2}}$ iterations. Computing distances $d_{i}$ takes $O(an)$ time, each iteration takes $O(bn)$ time. ∎∎

Proof.

Proof of Theorem 3.2 (sketch). Our key observation is that we can run the algorithm from Theorem 3.1 on a subset $X^{\prime}$ of $X$ , which is sufficiently dense in $X$ . Specifically, let $x^{\circ}$ be a $(1+\varepsilon)$ -approximate nearest neighbour for $x^{*}$ in $X$ . Assume that a subset $X^{\prime}\subset X$ contains $x^{\circ}$ and satisfies the following property: for every $x_{i}\in X\cap\operatorname{Ball}(x^{*},\|x^{*}-x^{\circ}\|/\varepsilon)$ , there exists $x_{j}\in X^{\prime}$ such that $\|x_{j}-x_{i}\|\leq\varepsilon\|x^{*}-x_{i}\|$ .

First, we will prove that by running the algorithm on set $X^{\prime}$ we get $y^{*}$ such that $\|y_{i}-y^{*}\|\leq(1+O(\varepsilon))L\|x_{i}-x^{*}\|$ for all $i$ . Then we describe a data structure that we use to find $X^{\prime}$ for a given query point $x^{*}$ in time $(1/\varepsilon)^{O(a)}\log n$ .

(1) Algorithm from Theorem 3.1 finds $y^{*}$ such that $\|y_{i}-y^{*}\|\leq(1+O(\varepsilon))L\|x_{i}-x^{*}\|$ for all $x_{i}\in X^{\prime}$ . Consider $x_{i}\in X$ . First, assume that $x_{i}\in\operatorname{Ball}(x^{*},\|x^{*}-x^{\circ}\|/\varepsilon)$ . Find $x_{j}\in X^{\prime}$ such that $\|x_{j}-x_{i}\|\leq\varepsilon\|x^{*}-x_{i}\|$ . Then

[TABLE]

as required. Now assume that $x_{i}\notin\operatorname{Ball}(x^{*},\|x^{*}-x^{\circ}\|/\varepsilon)$ .

[TABLE]

We use a data structure $\cal D$ for approximate nearest neighbor search in $X$ . We employ one of the constructions for low-dimensional Euclidean spaces, by either of Arya et al. (1994) or Har-Peled and Mendel (2006). Using $\cal D$ , we can find a $(1+\varepsilon/3)$ -approximate nearest neighbor of a point in $\mathbb{R}^{a}$ in time $(1/\varepsilon)^{O(a)}\log n$ . Recall that we can construct $\cal D$ in $O(2^{O(a)}n\log n)$ time, and it requires $O(2^{O(a)}n\log n)$ space. Suppose that we get a query point $x^{*}$ . We first find an approximate nearest neighbor $x^{\circ}$ for $x^{*}$ . Let $r=\|x^{\circ}-x^{*}\|$ . Take an $\varepsilon r/3$ net $N^{\prime}$ in the ball $\operatorname{Ball}(x^{*},r/\varepsilon)$ . For every point $p\in N^{\prime}$ , we find an approximate nearest neighbor $x(p)$ in $X$ (using $\cal D$ ). Let $X^{\prime}=\{x(p):p\in N^{\prime}\}\cup\{x^{\circ}\}$ . Consider $x_{i}\in\operatorname{Ball}(x^{*},r/\varepsilon)$ . There is $p\in X^{\prime}$ at distance at most $\varepsilon r/3$ from $x_{i}$ . Let $x_{j}=x(p)\in X^{\prime}$ . Then

[TABLE]

and

[TABLE]

as required. The size of $X^{\prime}$ is at most the size of $N^{\prime}$ , which is $(1/\varepsilon)^{O(a)}$ . ∎∎

Multi-point Lipschitz extension.

Finally, we describe an algorithm for the Multi-point Lipschitz Extension. The problem is a generalization of the problem we studied in Section 3.1 We are given a set of points $X=\{x_{1},\dots,x_{n}\}\subset\mathbb{R}^{a}$ and their images $Y=\{y_{1},\dots,y_{n}\}\subset\mathbb{R}^{b}$ under $L$ -Lipschitz map $f$ . Additionally, we are given a set $Z=\{x_{n+1},\dots,x_{n+n^{\prime}}\}\subset\mathbb{R}^{a}$ and a set of edges $E$ on $\{1,\dots,n+n^{\prime}\}$ . We need to extend $f$ to $Z$ — that is, find $y_{n+1},\dots,y_{n+n^{\prime}}$ — such that $\|y_{i}-y_{j}\|\leq(1+\varepsilon)L\|x_{i}-x_{j}\|$ for $(i,j)\in E$ . We note that $E$ may contain edges that impose Lipschitz constraints (i) between points in $X$ and $Z$ and (ii) between pairs of points in $Z$ . Without loss of generality, we assume that there are no edges $(i,j)\in E$ with $1\leq i,j\leq n$ .

Theorem 3.4.

There is an algorithm for the Multi-point Lipschitz Extension problem that runs in time

[TABLE]

where $m=|E|$ .

The algorithm and its analysis are almost identical to those for the Lipschitz Smoothing problem. (see Theorem 3.5).

3.2

Smoothing

Problem statement.

We reformulate the ERM problem $\hat{f}=\operatorname*{argmin}_{f\in F_{L}}\hat{R}_{n}(f)$ as follows. Given two sets of vectors, $(x_{i},y_{i})_{i\in[n]}$ , where $x_{i}\in X:=\mathbb{R}^{a}$ and $y_{i}\in Y:=\mathbb{R}^{b}$ , we wish to compute a “smoothed” version $\tilde{y}_{i}$ of the $y_{i}$ ’s so as to

[TABLE]

$\Phi(\mathbf{Y},\widetilde{\mathbf{Y}}):=\sum_{i=1}^{n}\|y_{i}-\tilde{y}_{i}\|^{2}$ is the distortion, and $\|\tilde{y}_{i}-\tilde{y}_{j}\|\leq L\|x_{i}-x_{j}\|$ for all $i,j\in[n]$ are the Lipschitz constraints. Here, $\mathbf{Y}=(y_{1},\dots,y_{n})$ and $\widetilde{\mathbf{Y}}=(\tilde{y}_{1},\dots,\tilde{y}_{n})$ (the columns of matrices $\mathbf{Y}$ and $\widetilde{\mathbf{Y}}$ are vectors $y_{1},\dots,y_{n}$ and $\tilde{y}_{1},\dots,\tilde{y}_{n}$ , respectively). Notice that when we use the $L_{2}$ norm, this problem is a quadratically constrained quadratic program (QCQP).

We consider a more general variant of this problem where we are given a set of edges $E$ on $\{1,\dots,n\}$ , and the goal is to ensure that the Lipschitz constraints $\|\tilde{y}_{i}-\tilde{y}_{j}\|\leq L\|x_{i}-x_{j}\|$ hold (only) for $(i,j)\in E$ . The original problem corresponds to the case when $E$ is the complete graph, ( $E_{ij}=L||x_{i}-x_{j}||$ ). Importantly, if the doubling dimension $\operatorname{ddim}X$ is low, we can solve the original problem by letting $([n],E)$ be a $(1+\varepsilon)$ -stretch spanner; then $m=n(1/\varepsilon)^{O(\operatorname{ddim})}$ (this approach was previously used by Gottlieb et al. (2017); see also Har-Peled and Mendel (2006, ,Section 8.2), who used a similar approach to compute the doubling constant). Our algorithm for Lipschitz Smoothing iteratively solves Laplace’s problem in the graph $G$ . We proceed to define this problem and present a closed-form formula for the solution.

Laplace’s problem.

We are given vectors $\{y_{i}\}$ , graph $G$ , and additionally vertex weights $\lambda_{i}\geq 0$ (for $i\in[n]$ ) and edge weights $\mu_{ij}\geq 0$ (for $(i,j)\in E$ ), find $\tilde{y}_{i}$ so as to

[TABLE]

Let $\mathcal{L}$ be the Laplacian of $G=([n],E)$ with edge weights $\mu_{ij}$ ; that is $L_{ii}=\sum_{j:j\neq i}\mu_{ij}$ and $L_{ij}=-\mu_{ij}$ for $i\neq j$ . Let $\Lambda=\operatorname{diag}(\lambda_{1},\dots,\lambda_{n})$ . Then

[TABLE]

This equation can be solved separately for each of $b$ rows of ${\mathbf{Y}}^{\top}$ using an nearly-linear equation solver for diagonally dominant matrices by Koutis et al. (2012) in total time $O(bm\log n\log(1/\varepsilon))$ (see also the paper by Spielman and Teng (2004), which presented the first nearly-linear time solve for diagonally dominant matrices).

We solve the Lipschitz Smoothing problem via the multiplicative weight update algorithm LipschitzSmooth, presented below. It was inspired by the algorithm for finding maximum flow using electrical networks by Christiano et al. (2011).

Analysis.

Let $\mathbf{Y^{*}}$ be the optimal solution to the Lipschitz Smoothing problem and and $\Phi_{0}$ be a $(1+\varepsilon)$ approximation to the optimal value; that is,

[TABLE]

(we assume that $\Phi_{0}$ is given to the algorithm; note that $\Phi_{0}$ can be found by binary search).

As in Section 3.1, we use the multiplicative-weight update (MWU) method. Let

[TABLE]

Note that functions $h_{\Phi}$ and $h_{ij}$ are concave.

Observe that $h_{\Phi}(\widetilde{\mathbf{Y}}^{*})\geq 0$ and $h_{ij}(\widetilde{\mathbf{Y}}^{*})\geq 0$ for every $(i,j)\in E$ . On the other hand, if $h_{\Phi}(\widetilde{\mathbf{Y}})\geq-\varepsilon$ and $h_{ij}(\widetilde{\mathbf{Y}})\geq-\varepsilon$ , then

[TABLE]

and $\|\tilde{y}_{i}-\tilde{y}_{j}\|\leq(1+\varepsilon)L\|x_{i}-x_{j}\|$ for every $(i,j)\in E$ .

In the Appendix, we describe the approximation oracle that we invoke in the MWU method.

Theorem 3.5.

There is an algorithm for the Lipschitz Smoothing problem that runs in time

[TABLE]

where $m=|E|$ .

Proof.

Proof of Theorem 3.5. From Theorem 3.5 in Arora et al. (2012), we get that the algorithm finds an $O(\varepsilon)$ approximate solution in $T=O\left(\frac{\sqrt{m/\varepsilon}\ln m}{\varepsilon^{2}}\right)=\left(\frac{\sqrt{m}}{\varepsilon^{5/2}}\right)$ iterations. Each iteration takes $O(bm\log n\log(1/\varepsilon))$ time (which is dominated by the time necessary to solve Laplace’s problem); additionally, we spend time $O(am)$ to compute pairwise distances between points in $X$ . ∎∎

4 Experiments

To illustrate the utility of our framework, we designed two simple non-linear transformation problems where the input and output are both scalars. Our data was generated uniformly at random over $[-2\pi,2\pi]$ and evaluated the performance on two cases: $f(x)=x^{3}$ and $f(x)=sin(x)$ .

Results.

In order to perfrom a fair, apples-to-apples comparison, we implemented both Algorithms 3 and 1 in Matlab, which standard, optimzied QCQP solvers, and performed the regression problem via the Kirszbraun extension technique. We compared the results of this learning method when using our methods for the optimization problems (MWU) vs using Matlab’s QCQP solver based on the interior-point algorithm (IntPt). We considered the squared Euclidean distance as the loss function. We ran several tests using different size data sets of 20, 100, 200, 500, and 1000 random points as training set, and 100 test points in all experiments. For reproducibility, we’ve used Matlab’s random seed 1 in all our runs. All the tests where conducted on the same Macbook pro computer. The numeric comparison (Tables 1-5) shows undoubtedly supremacy of the MWU over the IntPt method both in efficiency and better learning. MWU method is able to optimize a data set of several thousands data points, while the IntPt based method could not complete its process in “reasonable” time (over 10 hours night run) with more than $N=200$ training points. In terms of solving the learning problem, the MWU able to solve the QCQP problem and produce accurate smoothing and more accurate extensions functions as the data size grows. The IntPt method, on the other hand, able to meet all the constrains of the problem only with very small data set (less then 50 training points) which is insufficient data to for learning. Tables 1-5 shows that for 20 training points, the IntPt is able to train in 2.474 seconds and completely over fit the data set with 0 ERM, which leads to expected very poor generalization due to the size of the data (Table 5). On larger datasets the IntPt optimization fails to correctly solve the optimization problems with respect to all of the constrains. This result in several “heavy” outliers which affect heavily on the average square error of both smoothing and extension phases as can be seen in tables 1-5. Table 6 shows a graphical comparison for both implementations when the training set has $N=100$ points. The “heavy outliers” can be spotted easily on the graph, and explain why the same learning algorithm has such big differences when optimised with two different methods.

Tables 1-6 summarise the results for $f(x)=x^{3}$ . The results for $f(x)=\sin{x}$ are showing the same basic pattern and were added to the appendix. The blank entries in the tables indicate that the process did not terminate in the time allotted (12 hours).

5 Discussion and Conclusions

This work introduces a framework for performing regression between two Hilbert spaces based on Kirszbraun’s extension theorem, along with statistical analysis for this method. This task is decomposed into two stages: Smoothing (which corresponds to the training) and prediction (which achieved via Kirszbraun extension). Numerically solving our optimization problems has indicated a need for a more efficient solver for our optimization problems than off the shelf state-off-the-art solvers. We introduced two optimization algorithms, one for the smoothing problem and one for the extension, both are solved algorithmically via novel MWU schemes. Both analysis and experiments shows dramatically run time improvement for both optimization problems thus indicating that this algorithms are the main contribution off this work and are interest topic for future research on their own. Our code is also provided for reproducibility and to facilitate usage.

Acknowledgements.

AK was partially supported by the Israel Science Foundation (grant No. 1602/19), the Ben-Gurion University Data Science Research Center, and an Amazon Research Award. HZ was an MSc student at Ben-Gurion University of the Negev during part of this research.

Appendix A The Arora-Hazan-Kale result

For completeness, we quote here verbatim (except for the numbering) the relevant definitions and results from (Arora et al., 2012, Sec. 3.3.1, p. 137).

Imagine that we have the following feasibility problem:

[TABLE]

where $\mathcal{P}\in\mathbb{R}^{n}$ is a convex domain, and for $i\in[m]$ , $f_{i}:\mathcal{P}\rightarrow\mathbb{R}$ are concave functions. We wish to satisfy this system approximately, up to an additive error of $\varepsilon$ . We assume the existence of an Oracle, which, when given a probability distribution $\mathbf{p}=(p_{1},p_{2},\ldots,,p_{m})$ solves the following feasibility problem:

[TABLE]

An Oracle is said to be called $(\ell,\rho)$ -bounded if there is a fixed subset of constraints $I\subseteq[m]$ such that whenever it returns a feasible solution $\mathbf{x}$ to (3), all constraints $i\in I$ take values i the range $[-\ell,\rho]$ on the point $\mathbf{x}$ , and all the rest take values in $[-\rho,\ell]$ .

Theorem A.1 (Theorem 3.4 in Arora et al. (2012)).

Let $\varepsilon>0$ be a given error parameter. Suppose there exists an $(\ell,\rho)$ -bounded Oracle for the feasibility problem (2). Assume the $\ell\geq\varepsilon/2$ . Then there is an algorithms which either solves the problem up to an additive error of $\varepsilon$ , or correctly concludes that the system s infeasible, making only $O(\ell\rho\log(m)/\varepsilon^{2})$ calls to the Oracle, with an additional processing time of $O(m)$ per call.

Appendix B Approximate oracle

To use the MWU method (see Theorem 3.5 in Arora et al. (2012)), we design an approximate oracle for the following problem.

Problem B.1.

Given non-negative edge weights $w_{\Phi}$ and $w_{ij}$ , which add up to 1, find $\widetilde{\mathbf{Y}}$ such that

[TABLE]

Let $\mu_{ij}=\frac{w_{ij}+\varepsilon/(m+1)}{M^{2}\|x_{i}-x_{j}\|^{2}}$ and $\lambda_{i}=\lambda=(w_{\Phi}+\varepsilon/(m+1))/\Phi_{0}$ . We solve Laplace’s problem with parameters $\mu_{ij}$ and $\lambda_{i}$ (see Section 3.2 and Line 9 of the algorithm). We get a matrix $\widetilde{\mathbf{Y}}=(\tilde{y}_{1},\dots,\tilde{y}_{n})$ minimizing

[TABLE]

Consider the optimal solution $\tilde{y}_{1}^{*},\dots,\tilde{y}_{n}^{*}$ for Lipschitz Smoothing. We have

[TABLE]

We verify that $\widetilde{\mathbf{Y}}$ is a feasible solution for Problem B.1. We have

[TABLE]

as required.

Finally, we bound the width of the problem. We have $h_{\Phi}(\widetilde{\mathbf{Y}})\leq 1$ and $h_{ij}(\widetilde{\mathbf{Y}})\leq 1$ . Then, using (B), we get

[TABLE]

Therefore, $-h_{\Phi}(\widetilde{\mathbf{Y}})\leq O(\sqrt{m/\varepsilon})$ .

Similarly,

[TABLE]

Therefore, $-h_{ij}(\widetilde{\mathbf{Y}})\leq O(\sqrt{m/\varepsilon})$ .

Appendix C Generalization bounds

Let $\mathcal{X}\subset\mathbb{R}^{k}$ and $\mathcal{Y}\subset\mathbb{R}^{\ell}$ be the unit balls of their respective Hilbert spaces (each endowed with the $\ell_{2}$ norm $||\cdot||$ and corresponding inner product) and $\mathcal{H}_{L}\subset\mathcal{Y}^{\mathcal{X}}$ be the set of all $L$ -Lipschitz mappings from $\mathcal{X}$ to $\mathcal{Y}$ . In particular, every $h\in\mathcal{H}_{L}$ satisfies

[TABLE]

Let $\mathcal{F}_{L}\subset\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}$ be the loss class associated with $\mathcal{H}_{L}$ :

[TABLE]

In particular, every $f\in\mathcal{F}_{L}$ satisfies $0\leq f\leq 2$ .

Our goal is to bound the Rademacher complexity of $\mathcal{F}_{L}$ . We do this via a covering numbers approach.

The empirical Rademacher complexity of a collection of functions $\mathcal{F}$ mapping some set ${Z_{1},\dots,\mathcal{Z}_{n}}\subset\mathcal{Z}^{n}$ to $\mathbb{R}$ is defined by:

[TABLE]

Recall the relevance of Rademacher complexities to uniform deviation estimates for the risk functional $R(\cdot)$ (Mohri et al., 2012, Theorem 3.1): for every $\delta>0$ , with probability at least $1-\delta$ , for each $h\in\mathcal{H}_{L}$ :

[TABLE]

Define $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ and endow it with the norm $\left\|(x,y)\right\|_{\mathcal{Z}}=\left\|x\right\|+\left\|y\right\|$ ; note that $(\mathcal{Z},\left\|\cdot\right\|_{\mathcal{Z}})$ is a Banach but not a Hilbert space. First, we observe that the functions in $\mathcal{F}_{L}$ are Lipschitz under $\left\|\cdot\right\|_{\mathcal{Z}}$ . Indeed, choose any $f=f_{h}\in\mathcal{F}_{L}$ and $x,x^{\prime}\in\mathcal{X}$ , $y,y^{\prime}\in\mathcal{Y}$ . Then

[TABLE]

where $a\vee b:=\max\left\{a,b\right\}$ . We conclude that any $f\in\mathcal{F}_{L}$ is $(L\vee 1)<L+1$ -Lipschitz under $\left\|\cdot\right\|_{\mathcal{Z}}$ .

Since we restricted the domain and range of $\mathcal{H}_{L}$ , respectively, to the unit balls $B_{\mathcal{X}}$ and $B_{\mathcal{Y}}$ , the domain of $\mathcal{F}_{L}$ becomes $B_{\mathcal{Z}}:=B_{\mathcal{X}}\times B_{\mathcal{Y}}$ and its range is $[0,2]$ . Let us recall some basic facts about the $\ell_{2}$ covering of the $k$ -dimensional unit ball

[TABLE]

an analogous bound holds for $\mathcal{N}(t,B_{\mathcal{Y}},\left\|\cdot\right\|)$ . Now if $\mathcal{C}_{\mathcal{X}}$ is a collection of balls, each of diameter at most $t$ , that covers $B_{\mathcal{X}}$ and $\mathcal{C}_{\mathcal{Y}}$ is a similar collection covering $B_{\mathcal{Y}}$ , then clearly the collection of sets

[TABLE]

covers $B_{\mathcal{Z}}$ . Moreover, each $E\in\mathcal{C}_{\mathcal{Z}}$ is a ball of diameter at most $2t$ in $(\mathcal{Z},\left\|\cdot\right\|_{\mathcal{Z}})$ . It follows that

[TABLE]

Finally, we endow $F_{L}$ with the $\ell_{\infty}$ norm, and use a Kolmogorov-Tihomirov type covering estimate (see, e.g., Gottlieb et al. (2016, Lemma 5.2)):

[TABLE]

We can now use Gottlieb et al. (2016, Theorem 4.3)):

Theorem C.1.

Let $\mathcal{F}_{L}$ be the collection of $L$ -Lipschitz $[0,2]$ -valued functions defined on a metric space $(\mathcal{Z},||\cdot||_{Z})$ with diameter $1$ and doubling dimension $d$ . Then $\hat{R}_{n}(F_{L};\mathcal{Z})=O\big{(}\frac{L}{n^{1/(d+1)}}\big{)}$ .

Putting $d=k+\ell$ yields our generalization bound:

[TABLE]

Appendix D Additional experiments.

For completeness we add here the comparison of the results from the experiment for $f(x)=\sin(x)$ for $x\in[-2\pi,2\pi]$ .

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alman and Williams (2021) Alman J, Williams VV (2021) A refined laser method and faster matrix multiplication. In: Proceedings of the Symposium on Discrete Algorithms, SIAM, pp 522–539
2Arora et al. (2012) Arora S, Hazan E, Kale S (2012) The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing 8(1):121–164
3Arya et al. (1994) Arya S, Mount DM, Netanyahu N, Silverman R, Wu AY (1994) An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In: Symposium on Discrete Algorithms, pp 573–582
4Borchani et al. (2015) Borchani H, Varando G, Bielza C, Larrañaga P (2015) A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5(5):216–233
5Boyd and Vandenberghe (2004) Boyd S, Vandenberghe L (2004) Convex Optimization. Information Science and Statistics, Cambridge University press
6Brudnak (2006) Brudnak M (2006) Vector-valued support vector regression. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, IEEE, pp 1562–1569
7Bunch and Hopcroft (1974) Bunch JR, Hopcroft JE (1974) Triangular factorization and inversion by fast matrix multiplication. Mathematics of Computation 28(125):231–236
8Caruana (1997) Caruana R (1997) Multitask learning. Machine learning 28(1):41–75

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Efficient Kirszbraun Extension with Applications to Regression

Abstract

1 Introduction.

Regression.

Convex optimization.

Motivation and contribution.

Related work.

Main results.

2 Formal setup.

Metric space.

Kirszbraun theorem.

Learning problem.

3 Learning algorithm

Overview.

3.1 Approximate Lipschitz extension

Problem statement.

Theorem 3.1**.**

Theorem 3.2**.**

Analysis.

Problem 3.3**.**

Proof.

Proof.

Multi-point Lipschitz extension.

Theorem 3.4**.**

3.2

Problem statement.

Laplace’s problem.

Analysis.

Theorem 3.5**.**

Proof.

4 Experiments

Results.

5 Discussion and Conclusions

Acknowledgements.

Appendix A The Arora-Hazan-Kale result

Theorem A.1** (Theorem 3.4 in Arora et al. (2012)).**

Appendix B Approximate oracle

Problem B.1**.**

Appendix C Generalization bounds

Theorem C.1**.**

Appendix D Additional experiments.

Theorem 3.1.

Theorem 3.2.

Problem 3.3.

Theorem 3.4.

Theorem 3.5.

Theorem A.1 (Theorem 3.4 in Arora et al. (2012)).

Problem B.1.

Theorem C.1.