Efficient Kirszbraun Extension with Applications to Regression
Hanan Zaichyk, Armin Biess, Aryeh Kontorovich, Yury Makarychev

TL;DR
This paper presents a novel regression framework between Hilbert spaces using Kirszbraun's extension theorem, offering improved computational efficiency and empirical performance in supervised learning tasks.
Contribution
It introduces the first application of Kirszbraun's extension to supervised learning, with a new MWU algorithm that improves runtime and performance.
Findings
Quadratic runtime improvement over existing methods
Significant empirical performance gains
Effective decomposition into training and prediction stages
Abstract
We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun's extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves a quadratic runtime improvement over the state of the art. Our empirical results indicate a dramatic improvement over standard off-the-shelf solvers in our setting.
| training points | 20 | 100 | 200 | 500 | 1000 |
|---|---|---|---|---|---|
| Algorithm | Avg. loss | ||||
| MWU | 247.9405 | 0.3333 | 0.31581 | 0.31854 | 0.36143 |
| IntPt | 4.1e-18 | 46023.7964 | 353691.64 | ||
| training points | 20 | 100 | 200 | 500 | 1000 |
|---|---|---|---|---|---|
| Algorithm | Avg. loss | ||||
| MWU | 2.7505 | 19.9428 | 46.2479 | 212.4875 | 1243.2395 |
| IntPt | 18.1733 | 692.6523 | 4087.6655 | ||
| training points | 20 | 100 | 200 | 500 | 1000 |
|---|---|---|---|---|---|
| Algorithm | Avg. loss | ||||
| MWU | 0.092 | 0.7051 | 1.6326 | 8.0798 | 45.1497 |
| IntPt | 2.474 | 155.5521 | 766.5433 | ||
| training points | 100 | 200 | 500 | 1000 | |
|---|---|---|---|---|---|
| Algorithm | Avg. loss | ||||
| MWU | 1119.4705 | 0.3333 | 0.37267 | 0.43678 | 0.52797 |
| IntPt | 3065.5698 | 9475.260 | 9475.4864 | ||
| training points | 20 | 100 | 200 | 500 | 1000 |
|---|---|---|---|---|---|
| Algorithm | Avg. loss | ||||
| MWU | 0.054 | 0.0014903 | 0.002826 | 0.0051038 | 0.0089853 |
| IntPt | 1.3367 | 1.6918 | 2.6156 | ||
| Smoothing | Extension | |
|---|---|---|
| MWU | ||
| IntPt |
| training points | 100 | 200 | 500 | 1000 |
|---|---|---|---|---|
| Algorithm | Avg. loss | |||
| MWU | 5.5e-09 | 1.6639e-08 | 3.7683e-08 | 6.8339e-08 |
| IntPt | 5.3e-16 | 5913495.3623 | ||
| training points | 100 | 200 | 500 | 1000 |
|---|---|---|---|---|
| Algorithm | Avg. loss | |||
| MWU | 5.1933 | 11.3827 | 66.5677 | 335.0659 |
| IntPt | 3508.4048 | 3936.6494 | ||
| training points | 100 | 200 | 500 | 1000 |
|---|---|---|---|---|
| Algorithm | Avg. loss | |||
| MWU | 0.78109 | 1.8081 | 8.1855 | 48.5127 |
| IntPt | 114.5577 | 778.1016 | ||
| training points | 100 | 200 | 500 | 1000 |
|---|---|---|---|---|
| Algorithm | Avg. loss | |||
| MWU | 5.5e-09 | 1.09e-08 | 3.3e-08 | 7.3e-08 |
| IntPt | 0.2877 | 0.83248 | ||
| training points | 100 | 200 | 500 | 1000 |
|---|---|---|---|---|
| Algorithm | Avg. loss | |||
| MWU | 0.0026 | 0.0038 | 0.0061 | 0.0095 |
| IntPt | 4.3164 | 1.824 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNumerical methods in engineering · Domain Adaptation and Few-Shot Learning · Sparse and Compressive Sensing Techniques
Efficient Kirszbraun Extension with Applications to Regression
Hanan Zaichyk
Ben-Gurion University
Armin Biess
Ben-Gurion University
Aryeh Kontorovich
Ben-Gurion University
Yury Makarychev
Toyota Technological Institute at Chicago
Abstract
We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun’s extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves a quadratic runtime improvement over the state of the art. Our empirical results indicate a dramatic improvement over standard off-the-shelf solvers in our setting.
1 Introduction.
Regression.
The classical problem of estimating a continuous-valued function from noisy observations, known as regression, is of central importance in statistical theory with a broad range of applications; see, for example, Györfi et al. (2006); Nadaraya (1989). When the target function is assumed to have a specific structure, the regression problem is termed parametric and the optimization problem is finite-dimensional. Linear regression (Mohri et al., 2012, chapter 10.3.1) is perhaps the simplest and most common type of parametric regression. When no structural assumptions concerning the target function are made, the regression problem is nonparametric. Informally, the main objective in the study of nonparametric regression is to understand the relationship between the regularity conditions that a function class might satisfy (e.g., Lipschitz or Hölder continuity, or sparsity in some representation) and its behavior vis-à-vis optimization and generalization. Most existing algorithms for regression either focus on the scalar-valued case or else reduce multiple outputs to several scalar problems (Borchani et al., 2015), see Related Work.
Convex optimization.
Many learning problems can be cast in the framework of convex optimization. In particular, regression naturally lends itself to this formulation. While some cases, such as linear regression, admit efficient closed form solutions, this is not the case in general. Typically, convex optimization problems are solved via iterative methods up to a specified accuracy. One general approach is the interior-point methods, which, on problems with variables and constraints achieves a runtime of , where id the cost of evaluating the first and second derivatives of the objective and the constrains (Boyd and Vandenberghe, 2004).
Motivation and contribution.
The chief motivation of this work was to generalize the results of Gottlieb et al. (2017), who provided efficient nonparametric regression methods in the scalar-output case. Attempts to numerically solve our optimization problem, which is naturally formulated as a Quadratically Constrained Quadratic Program (QCQP), via state-of-the-art off the shelf solvers indicated that these are incapable of handling our framework, even for relatively small data sets and dimensions. This limitation of QCQP solvers motivated us to develop a specialized algorithm to solve the optimization problem entailed by our regression setting.
In Section 4, we show that our specialized algorithm dramatically outperforms general-purposed QCQP solvers; this algorithm, its theoretical analysis, and MATLAB code111 available at https://github.com/HananZaichyk/Kirszbraun-extension
are the main contributions of this paper. We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun’s extension theorem — to the best of our knowledge, the first application of this technique to supervised learning. This method directly exploits the metrics of the input and output spaces, which makes it explicitly sensitive to the interaction among the output components. Although our main contributions are algorithmic, a statistical analysis of our regression technique is provided in Appendix C.
We formulate the regression problem in two stages: smoothing and extension, which are formally described in Section 3. Roughly speaking, on a dataset of size with input and output dimensions, we formulate the smoothing problem as a Quadratically Constrained Quadratic Program (QCQP) problem with variables and constraints. The extension problem is also formulated as a QCQP with variables and constraints.
Although general QCQP problems are not convex, our special instance is, and as such is, in principle, amenable to the standard convex optimization framework, such as interior point methods. When solving large-scale problems, even a modest improvement in the exponent yields dramatic runtime savings. We propose a Multiplicative Weight Update (MWU) scheme to solve the smoothing problem, to a constant precision, in runtime and the extension problem in runtime .
Related work.
Previous approaches to vector-valued regression include -insensitive SVM with -norm regularization (Brudnak, 2006), least-squares and MLE-based methods (Jain and Tewari, 2015), and (for linear models) the Danzig selector (Chen and Banerjee, 2018). According to a recent survey (Borchani et al., 2015), existing methods essentially “transform the multi-output problem into independent single-output problems.” Some approaches to multitask learning problems (Caruana, 1997) exploit relations between the different tasks. In econometrics, this decoupling of the outputs is made explicit in the Seemingly Unrelated Regressions (SUR) model (Davidson et al., 1993; Greene, 2003, 2012). These approaches however, do not seem to encapsulate the need of a single vector output with possibly strong relations between its coordinates. In our approach, we devise a principled approach for leveraging the dependencies via Kirszbraun extension. The latter has previously been applied by Mahabadi et al. (2018) to dimensionality reduction (unsupervised learning), but to the best of our knowledge has not been used in the supervised learning setting.
Both of our problems (smoothing and extension) may be formulated as QCQP programs, whose most general form is
[TABLE]
where , and are vectors, are matrices, and the are scalars. The general problem is NP-hard, but when all of the are semi-definite, the problem is convex and can be solved in polynomial time (Boyd and Vandenberghe, 2004). The QCQP is usually solved in practice using log-barrier or primal-dual interior-point methods. The running time of an optimization algorithm based on the interior-point methods significantly depends on the problem at hand. Specifically, consider a problem with variables and constraints. In order to obtain a -approximate solution, the algorithm has to perform iterations in the worst-case (Nesterov and Nemirovskii, 1994, Chapter 6). In each iteration, the algorithm has to initialize and invert an Hessian matrix (or equivalently solve a system of linear equations with variables). The time required to initialize the Hessian matrix is problem specific: while it is in the worst case, it is often significantly less than that. The Hessian matrix can be inverted in time, where is the matrix multiplication exponent (Bunch and Hopcroft, 1974) (the best current upper bound on is (Alman and Williams, 2021)). However, to the best of our knowledge, all implementations used in practice perform this step in time. That said, this step can be significantly sped up if the Hessian matrix has a special structure.
Our Multiplicative Weight Update (MWU) scheme is based on the framework of Arora et al. (2012). We include the relevant background and results in the Appendix for completeness.
Main results.
We cast the general regression problem between Hilbert spaces as two QCQP programs, and provide an efficient algorithm for each problem.
The problem setup, formalized in Section 2, involves a dataset of size of vectors in an -dimensional Euclidean space labeled by -dimensional vectors. The smoothing (also: training, regularization, denoising) problem (Section 3.2) is to perturb the labels so as to achieve the user-specified Lipschitz smoothness constraint while incurring a minimum distortion. This is a standard statistical technique, known as regularization, which prevents overfitting in prediction. Our Theorem 3.5 solves the smoothing optimization problem, up to a tolerance , in runtime .
Next, we address the task of prediction (i.e., assigning a label to a test point). In Theorem 3.1, we accomplish this via -approximate Kirszbraun extension of the smoothed dataset, in runtime . For small , an improvement is possible: a data structure can be constructed off-line at a (once) runtime cost of that allows to answer (multiple) future prediction queries in time
[TABLE]
In Section 4, we compare the performance of our MWU-based approach to a state of the art interior-point based solver and report a significant runtime advantage, which allows to process larger samples and ultimately yields greater accuracy.
Finally, for completeness, in Section C, we include a Rademacher-based analysis of the generalization error of our regression algorithm.
2 Formal setup.
Metric space.
A metric space is a set equipped with a symmetric function satisfying and the triangle inequality. Given two metric spaces and , a function is -Lipschitz if for all ; its Lipschitz constant is the smallest for which the latter inequality holds. For any metric space and , the following classic Lipschitz extension result, essentially due to McShane (1934); Whitney (1934), holds. If is Lipschitz (under the inherited metric) then there is an extension that coincides with on and . A Hilbert space is a vector space (in our case, over ) equipped with an inner product , which is a positive-definite symmetric bilinear form; further, is complete in the metric .
Kirszbraun theorem.
Kirszbraun (1934) proved that for two Hilbert spaces and , and mapping to , there is an extension such that . This result is in general false for Banach spaces whose norm is not induced by an inner product (Naor, 2015).
Learning problem.
We assume a familiarity with the abstract agnostic learning framework and refer the reader to Mohri et al. (2012) for background. Our approach will be applied to learn a mapping between two Hilbert spaces, and . We assume a fixed unknown distribution on and a labeled sample of input-output examples. The risk of a given mapping is defined as ; implicit here is our designation of the metric of as the loss function. Analogously, the empirical risk of on a labeled sample is given by . In this paper, we always take and , each equipped with the standard Euclidean metric. Uniform deviation bounds on , over all with are given in Section C.
3 Learning algorithm
Overview.
We follow the basic strategy proposed by Gottlieb et al. (2017) for real-valued regression. We are given a labeled sample , where and . For a user-specified Lipschitz constant , we compute the (approximate) Empirical Risk Minimizer (ERM) over . (A standard method for tuning is via Structural Risk Minimization (SRM): One computes a generalization bound , where , as derived in in Section C, and chooses to minimize this. We omit this standard stage of the learning process.)
Predicting the value at a test point amounts to Lipschitz-extending from to . Equivalently, the ERM stage may be viewed as a smoothing procedure, where and is the smoothed sample — which is then (approximately) Lipschitz-extended to . We proceed to describe each stage in detail.
3.1 Approximate Lipschitz extension
Problem statement.
Given a finite sequence , its image under some -Lipschitz map , a test point , and a precision parameter , we wish to compute so that for all . Our first result is an efficient algorithm for achieving this:
Theorem 3.1**.**
The approximate Lipschitz extension algorithm OnePointExtension has runtime .
The query runtime can be significantly improved if the dimension of is moderate:
Theorem 3.2**.**
There is a data structure for the Lipschitz extension problem of memory size that can be constructed in time . Given a query point and a parameter , one can compute such that for every in time .
Analysis.
We analyze algorithm OnePointExtension 1 and prove Theorems 3.1 and 3.2 via the multiplicative update framework of Arora et al. (2012). In particular, we will invoke their Theorem 3.4, which, for completeness, is reproduced in Section A as Theorem A.1. To simplify the notation, we assume (without loss of generality) that . Let , and define for . Then the Lipschitz extension problem is equivalent to the following: find such that for all . Note that functions are concave and thus the problem is in the form of (3.8) from Arora et al. (2012). We now bound the “width” of the problem, proving that for every (in the notation from Arora et al. (2012), we show that and ). Observe that for every and every , we have (i) as and (ii)
[TABLE]
Here, we used that (which is true since ), (which is true since is 1-Lipschitz), and (which is true since is the point closest to among all points ). We conclude that .
To apply Theorem A.1, we design an oracle for the following problem:
Problem 3.3**.**
Given non-negative weights that add up to , find such that
[TABLE]
Note that Problem 3.3 has a solution, since , the Lipschitz extension of to (whose existence is guaranteed by the Kirzsbraun theorem), satisfies (1). Define auxiliary weights and as follows:
[TABLE]
[TABLE]
The oracle finds and outputs that minimizes . To this end, it first computes . Note that Then, if , it sets ; otherwise, is set to be the point closest to in , which is
[TABLE]
This is computed on lines 6–8 of the algorithm. We verify that satisfies condition (1). Rewrite condition (1) in terms of weights : . Using that
[TABLE]
we get
[TABLE]
The first inequality is due to Cauchy–Schwarz, and the second holds since .
Proof.
Proof of Theorem 3.1. From Theorem A.1, we get that the algorithm finds a approximate solution in iterations. Computing distances takes time, each iteration takes time. ∎∎
Proof.
Proof of Theorem 3.2 (sketch). Our key observation is that we can run the algorithm from Theorem 3.1 on a subset of , which is sufficiently dense in . Specifically, let be a -approximate nearest neighbour for in . Assume that a subset contains and satisfies the following property: for every , there exists such that .
First, we will prove that by running the algorithm on set we get such that for all . Then we describe a data structure that we use to find for a given query point in time .
(1) Algorithm from Theorem 3.1 finds such that for all . Consider . First, assume that . Find such that . Then
[TABLE]
as required. Now assume that .
[TABLE]
We use a data structure for approximate nearest neighbor search in . We employ one of the constructions for low-dimensional Euclidean spaces, by either of Arya et al. (1994) or Har-Peled and Mendel (2006). Using , we can find a -approximate nearest neighbor of a point in in time . Recall that we can construct in time, and it requires space. Suppose that we get a query point . We first find an approximate nearest neighbor for . Let . Take an net in the ball . For every point , we find an approximate nearest neighbor in (using ). Let . Consider . There is at distance at most from . Let . Then
[TABLE]
and
[TABLE]
as required. The size of is at most the size of , which is . ∎∎
Multi-point Lipschitz extension.
Finally, we describe an algorithm for the Multi-point Lipschitz Extension. The problem is a generalization of the problem we studied in Section 3.1 We are given a set of points and their images under -Lipschitz map . Additionally, we are given a set and a set of edges on . We need to extend to — that is, find — such that for . We note that may contain edges that impose Lipschitz constraints (i) between points in and and (ii) between pairs of points in . Without loss of generality, we assume that there are no edges with .
Theorem 3.4**.**
There is an algorithm for the Multi-point Lipschitz Extension problem that runs in time
[TABLE]
where .
The algorithm and its analysis are almost identical to those for the Lipschitz Smoothing problem. (see Theorem 3.5).
3.2
Smoothing
Problem statement.
We reformulate the ERM problem as follows. Given two sets of vectors, , where and , we wish to compute a “smoothed” version of the ’s so as to
[TABLE]
is the distortion, and for all are the Lipschitz constraints. Here, and (the columns of matrices and are vectors and , respectively). Notice that when we use the norm, this problem is a quadratically constrained quadratic program (QCQP).
We consider a more general variant of this problem where we are given a set of edges on , and the goal is to ensure that the Lipschitz constraints hold (only) for . The original problem corresponds to the case when is the complete graph, (). Importantly, if the doubling dimension is low, we can solve the original problem by letting be a -stretch spanner; then (this approach was previously used by Gottlieb et al. (2017); see also Har-Peled and Mendel (2006, ,Section 8.2), who used a similar approach to compute the doubling constant). Our algorithm for Lipschitz Smoothing iteratively solves Laplace’s problem in the graph . We proceed to define this problem and present a closed-form formula for the solution.
Laplace’s problem.
We are given vectors , graph , and additionally vertex weights (for ) and edge weights (for ), find so as to
[TABLE]
Let be the Laplacian of with edge weights ; that is and for . Let . Then
[TABLE]
This equation can be solved separately for each of rows of using an nearly-linear equation solver for diagonally dominant matrices by Koutis et al. (2012) in total time (see also the paper by Spielman and Teng (2004), which presented the first nearly-linear time solve for diagonally dominant matrices).
We solve the Lipschitz Smoothing problem via the multiplicative weight update algorithm LipschitzSmooth, presented below. It was inspired by the algorithm for finding maximum flow using electrical networks by Christiano et al. (2011).
Analysis.
Let be the optimal solution to the Lipschitz Smoothing problem and and be a approximation to the optimal value; that is,
[TABLE]
(we assume that is given to the algorithm; note that can be found by binary search).
As in Section 3.1, we use the multiplicative-weight update (MWU) method. Let
[TABLE]
Note that functions and are concave.
Observe that and for every . On the other hand, if and , then
[TABLE]
and for every .
In the Appendix, we describe the approximation oracle that we invoke in the MWU method.
Theorem 3.5**.**
There is an algorithm for the Lipschitz Smoothing problem that runs in time
[TABLE]
where .
Proof.
Proof of Theorem 3.5. From Theorem 3.5 in Arora et al. (2012), we get that the algorithm finds an approximate solution in iterations. Each iteration takes time (which is dominated by the time necessary to solve Laplace’s problem); additionally, we spend time to compute pairwise distances between points in . ∎∎
4 Experiments
To illustrate the utility of our framework, we designed two simple non-linear transformation problems where the input and output are both scalars. Our data was generated uniformly at random over and evaluated the performance on two cases: and .
Results.
In order to perfrom a fair, apples-to-apples comparison, we implemented both Algorithms 3 and 1 in Matlab, which standard, optimzied QCQP solvers, and performed the regression problem via the Kirszbraun extension technique. We compared the results of this learning method when using our methods for the optimization problems (MWU) vs using Matlab’s QCQP solver based on the interior-point algorithm (IntPt). We considered the squared Euclidean distance as the loss function. We ran several tests using different size data sets of 20, 100, 200, 500, and 1000 random points as training set, and 100 test points in all experiments. For reproducibility, we’ve used Matlab’s random seed 1 in all our runs. All the tests where conducted on the same Macbook pro computer. The numeric comparison (Tables 1-5) shows undoubtedly supremacy of the MWU over the IntPt method both in efficiency and better learning. MWU method is able to optimize a data set of several thousands data points, while the IntPt based method could not complete its process in “reasonable” time (over 10 hours night run) with more than training points. In terms of solving the learning problem, the MWU able to solve the QCQP problem and produce accurate smoothing and more accurate extensions functions as the data size grows. The IntPt method, on the other hand, able to meet all the constrains of the problem only with very small data set (less then 50 training points) which is insufficient data to for learning. Tables 1-5 shows that for 20 training points, the IntPt is able to train in 2.474 seconds and completely over fit the data set with 0 ERM, which leads to expected very poor generalization due to the size of the data (Table 5). On larger datasets the IntPt optimization fails to correctly solve the optimization problems with respect to all of the constrains. This result in several “heavy” outliers which affect heavily on the average square error of both smoothing and extension phases as can be seen in tables 1-5. Table 6 shows a graphical comparison for both implementations when the training set has points. The “heavy outliers” can be spotted easily on the graph, and explain why the same learning algorithm has such big differences when optimised with two different methods.
Tables 1-6 summarise the results for . The results for are showing the same basic pattern and were added to the appendix. The blank entries in the tables indicate that the process did not terminate in the time allotted (12 hours).
5 Discussion and Conclusions
This work introduces a framework for performing regression between two Hilbert spaces based on Kirszbraun’s extension theorem, along with statistical analysis for this method. This task is decomposed into two stages: Smoothing (which corresponds to the training) and prediction (which achieved via Kirszbraun extension). Numerically solving our optimization problems has indicated a need for a more efficient solver for our optimization problems than off the shelf state-off-the-art solvers. We introduced two optimization algorithms, one for the smoothing problem and one for the extension, both are solved algorithmically via novel MWU schemes. Both analysis and experiments shows dramatically run time improvement for both optimization problems thus indicating that this algorithms are the main contribution off this work and are interest topic for future research on their own. Our code is also provided for reproducibility and to facilitate usage.
Acknowledgements.
AK was partially supported by the Israel Science Foundation (grant No. 1602/19), the Ben-Gurion University Data Science Research Center, and an Amazon Research Award. HZ was an MSc student at Ben-Gurion University of the Negev during part of this research.
Appendix A The Arora-Hazan-Kale result
For completeness, we quote here verbatim (except for the numbering) the relevant definitions and results from (Arora et al., 2012, Sec. 3.3.1, p. 137).
Imagine that we have the following feasibility problem:
[TABLE]
where is a convex domain, and for , are concave functions. We wish to satisfy this system approximately, up to an additive error of . We assume the existence of an Oracle, which, when given a probability distribution solves the following feasibility problem:
[TABLE]
An Oracle is said to be called -bounded if there is a fixed subset of constraints such that whenever it returns a feasible solution to (3), all constraints take values i the range on the point , and all the rest take values in .
Theorem A.1** (Theorem 3.4 in Arora et al. (2012)).**
Let be a given error parameter. Suppose there exists an -bounded Oracle for the feasibility problem (2). Assume the . Then there is an algorithms which either solves the problem up to an additive error of , or correctly concludes that the system s infeasible, making only calls to the Oracle, with an additional processing time of per call.
Appendix B Approximate oracle
To use the MWU method (see Theorem 3.5 in Arora et al. (2012)), we design an approximate oracle for the following problem.
Problem B.1**.**
Given non-negative edge weights and , which add up to 1, find such that
[TABLE]
Let and . We solve Laplace’s problem with parameters and (see Section 3.2 and Line 9 of the algorithm). We get a matrix minimizing
[TABLE]
Consider the optimal solution for Lipschitz Smoothing. We have
[TABLE]
We verify that is a feasible solution for Problem B.1. We have
[TABLE]
as required.
Finally, we bound the width of the problem. We have and . Then, using (B), we get
[TABLE]
Therefore, .
Similarly,
[TABLE]
Therefore, .
Appendix C Generalization bounds
Let and be the unit balls of their respective Hilbert spaces (each endowed with the norm and corresponding inner product) and be the set of all -Lipschitz mappings from to . In particular, every satisfies
[TABLE]
Let be the loss class associated with :
[TABLE]
In particular, every satisfies .
Our goal is to bound the Rademacher complexity of . We do this via a covering numbers approach.
The empirical Rademacher complexity of a collection of functions mapping some set to is defined by:
[TABLE]
Recall the relevance of Rademacher complexities to uniform deviation estimates for the risk functional (Mohri et al., 2012, Theorem 3.1): for every , with probability at least , for each :
[TABLE]
Define and endow it with the norm ; note that is a Banach but not a Hilbert space. First, we observe that the functions in are Lipschitz under . Indeed, choose any and , . Then
[TABLE]
where . We conclude that any is -Lipschitz under .
Since we restricted the domain and range of , respectively, to the unit balls and , the domain of becomes and its range is . Let us recall some basic facts about the covering of the -dimensional unit ball
[TABLE]
an analogous bound holds for . Now if is a collection of balls, each of diameter at most , that covers and is a similar collection covering , then clearly the collection of sets
[TABLE]
covers . Moreover, each is a ball of diameter at most in . It follows that
[TABLE]
Finally, we endow with the norm, and use a Kolmogorov-Tihomirov type covering estimate (see, e.g., Gottlieb et al. (2016, Lemma 5.2)):
[TABLE]
We can now use Gottlieb et al. (2016, Theorem 4.3)):
Theorem C.1**.**
Let be the collection of -Lipschitz -valued functions defined on a metric space with diameter and doubling dimension . Then \hat{R}_{n}(F_{L};\mathcal{Z})=O\big{(}\frac{L}{n^{1/(d+1)}}\big{)}.
Putting yields our generalization bound:
[TABLE]
Appendix D Additional experiments.
For completeness we add here the comparison of the results from the experiment for for .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alman and Williams (2021) Alman J, Williams VV (2021) A refined laser method and faster matrix multiplication. In: Proceedings of the Symposium on Discrete Algorithms, SIAM, pp 522–539
- 2Arora et al. (2012) Arora S, Hazan E, Kale S (2012) The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing 8(1):121–164
- 3Arya et al. (1994) Arya S, Mount DM, Netanyahu N, Silverman R, Wu AY (1994) An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In: Symposium on Discrete Algorithms, pp 573–582
- 4Borchani et al. (2015) Borchani H, Varando G, Bielza C, Larrañaga P (2015) A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5(5):216–233
- 5Boyd and Vandenberghe (2004) Boyd S, Vandenberghe L (2004) Convex Optimization. Information Science and Statistics, Cambridge University press
- 6Brudnak (2006) Brudnak M (2006) Vector-valued support vector regression. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, IEEE, pp 1562–1569
- 7Bunch and Hopcroft (1974) Bunch JR, Hopcroft JE (1974) Triangular factorization and inversion by fast matrix multiplication. Mathematics of Computation 28(125):231–236
- 8Caruana (1997) Caruana R (1997) Multitask learning. Machine learning 28(1):41–75
