Vector-valued Reproducing Kernel Banach Spaces with Group Lasso Norms
Liangzhi Chen, Haizhang Zhang, Jun Zhang

TL;DR
This paper develops a mathematical framework for vector-valued reproducing kernel Banach spaces with group lasso norms, enabling sparse multi-task learning with theoretical guarantees and new reproducing kernels.
Contribution
It introduces RKBSs with $ ext{l}_{p,1}$-norms supporting the linear representer theorem and proposes admissible reproducing kernels for sparse multi-task learning.
Findings
Established a theoretical foundation for RKBSs with group lasso norms.
Proved the support of the linear representer theorem in this setting.
Designed reproducing kernels suitable for sparse multi-task learning.
Abstract
Focusing on establishing a mathematical basis for kernel methods in sparse multi-task learning, we explore the theory of vector-valued reproducing kernel Banach spaces (RKBSs) endowed with -norms (), encompassing both the sparse learning case when and the group lasso when . We develop RKBSs equipped with these group lasso norms that support the linear representer theorem for regularized learning frameworks. Additionally, we introduce reproducing kernels admissible for this construction. Such reproducing kernels are applicable to sparse multi-task learning with group lasso norms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNumerical methods in inverse problems · Mathematical Analysis and Transform Methods · Statistical Methods and Inference
Vector-valued Reproducing Kernel Banach Spaces with Group Lasso Norms*
††thanks: Supported by Natural Science Foundation of China under grants 11571377 and 11222103, and by DARPA/ARO under grant #W911NF-16-1-0383.
Liangzhi Chen
School of Data and Computer Science
*Sun Yat-sen University
*Guangzhou, China
Haizhang Zhang
*School of Data and Computer Science
Sun Yat-sen University
*Guangzhou, China
Jun Zhang
Department of Psychology,
- and Department of Mathematics*
*University of Michigan
*Ann Arbor, MI 48109, USA
Abstract
Aiming at a mathematical foundation for kernel methods in coefficient regularization for multi-task learning, we investigate theory of vector-valued reproducing kernel Banach spaces (RKBS) with -norms, which contains the sparse learning scheme and the group lasso . We construct RKBSs that are equipped with such group lasso norms and admit the linear representer theorem for regularized learning schemes. The corresponding kernels that are admissible for the construction are discussed.
Index Terms:
vector-valued spaces, reproducing kernel Banach spaces, multi-task learning, the representer theorem
I Introduction
Learning theory focuses on finding good-performed predictors based on limited data. But solving such problems could often arise ill-posed problems [37, 27]. Regularization is a widely used method to deal with such phenomena. It is formulated as an optimization problem involves an error term and a regularizer. Consider the following optimal problem
[TABLE]
where is a space of functions on some data set , is a set of input/output data, is a regularization parameter, is an error function and is called the regularizer function.
Classical cases of (1) are regularized by Euclidian norms, or more general, Hilbertian norms. These have been thoroughly studied in the literature, [11, 30, 4]. Learning in reproducing kernel Hilbert spaces (RKHSs) have received considerable attentions over the past few decades in machine learning [3, 30, 31], statistical learning [40, 4] and stochastic process [28], etc. There are many reasons account for the success of learning methods in RKHSs. Firstly, kernels can be used to measure the similarity between input points due to the “kernel tricks”. Secondly, an RKHS is a Hilbert space of functions on for which point evaluations are continuous linear functionals. Sample data available for learning are usually modeled by point evaluations of the unknown target function. Finally, by the Riesz representation theorem, the point evaluation functionals on can be represented by its associated reproducing kernel. These facts lead to the celebrated representer theorem [19, 2], which is desirable for learning approach in high dimensional or infinite dimensional spaces.
However, it is difficult to enhance the performance of learning approaches in an RKHS due to its simple geometrical structure. Recently, theoretic work on learning in scalar-valued RKBSs [24, 46, 38, 43, 32, 35, 48, 34] and in the multi-task learning settings [5, 20, 25, 7, 8, 1, 47, 49] have been systematically studied. The work on -norm RKBSs [34] has caught much attention. This is due to that -norm regularization [36] in single-task learning problems often result in sparse solutions [6, 13, 39], which is desired in machine learning. Sparsity is essential for extracting relatively low dimensional features from sample data that usually live in high dimensional spaces.
Multi-task learning appear more often in applications. Methods based on single task learning techniques assume unnaturally that tasks are independent from each other, and usually tend to perform poorly for small data sets. By contrast, multi-task learning uses correlated information to improve the performance of the whole learning process. Many multi-task learning approaches have been proposed to boost the efficiency of lasso in coping such problems, such as, the smoothly clipped absolute deviation [15, 41], the adaptive lasso [50], the relaxed lasso [23], the group lasso [45] and the sparse group lasso [16, 33]. Numerical experiments in [9, 14, 23, 25, 16] show that the multi-task learning tends to provide better learning results than the single task learning.
The main task of this paper is to develop the learning theory for vector-valued RKBSs with the norms. When , this reduces to the -norm vector-valued RKBS recently studied in [20]. Our approach is more general and includes the important group lasso case when . Our first objective is to construct an -norm vector-valued RKBS based on admissible kernels, and then to derive the representer theorem for regularized learning schemes. These are the main contents of section 3 and 4. Our second objective focuses on the admissible kernels. In section 5, we give a family of new admissible kernels, and then discuss kernel functions with their Lebesgue constants bounded above by .
Before entering the subject of the paper, we make a list on former researches on RKBSs:
Scalar-valued RKBSs [46, 48] and vector-valued RKBSs [49] built on uniformly convex and uniformly smooth Banach spaces via semi-inner products [22]. 2. 2.
Scalar-valued RKBS with the -norm [34, 35]. 3. 3.
The -norm scalar-valued RKBSs [44] developed via dual-bilinear forms and the generalized Mercer kernels. 4. 4.
Vector-valued RKBSs with the -norm [20]. 5. 5.
Generic definitions and unified framework of construction of scalar-valued RKBSs [21].
II Preliminaries and Notations
Throughout this paper, always denotes a real number lies in the extended interval , and is its conjugate number such that (if then ; and if then ). The notation denotes the set of all positive integers and is defined for every . Let and be the sets of complex numbers, real numbers and nonnegative real numbers, respectively. For any Banach space, denote by its zero element.
For a Banach space , denote its dual Banach space by . When , denote be the classical countable infinite dimensional Hilbert space. When , is assumed to be a finite dimensional complex Euclidian space with the -norm. Note that and as for , has assumed to be finite-dimensional. Denote the bilinear form on by . Thus, for elements , and .
Let be two Banach spaces, then denote the space of all bounded linear operators from to . Then is also a Banach space. For any , its operator norm is defined by
[TABLE]
For any nonempty set , we introduce
[TABLE]
Here, the set might be uncountable, but this causes no trouble, as any element in has at most countable nonzero coordinates.
We denote the set of samplings in an input space by , and the corresponding observations by . For later convenience, we introduce the following notation. Denote by
[TABLE]
an matrix with entries in . Its associated vectors are denoted by
[TABLE]
and
[TABLE]
II-A Reproducing kernel Banach spaces of vector-valued functions
Before giving a formal definition of RKBSs of vector-valued functions, we recall some terminologies.
Definition II.1**.**
[46]* A space is called a Banach space of vector-valued functions if the point evaluation functionals are consistent with the norm on in the sense that for all , if and only if for every . A Banach space of vector-valued functions on is said to be a pre-RKBS on if point evaluations are continuous linear functionals on . *
To accommodate the main purpose of this paper, we present a slightly different version of RKBSs of vector-valued functions from [49]. Denote a space with the norm by .
Definition II.2**.**
The Banach spaces and are RKBSs of vector-valued functions from to provided that
- (i)
* and are pre-RKBS of vector-valued functions;* 2. (ii)
There exists a kernel function such that
[TABLE] 3. (iii)
In addition, the reproducing properties hold true in the sense that
[TABLE]
for all .
*Under these assumptions, is called the reproducing kernel of and . *
II-B Admissible kernels
The requirements of a kernel function that can be used to construct a vector-valued RKBS with the -norm are formulated as follows.
Definition II.3** (Admissible Kernels).**
A kernel is admissible for the construction of RKBS of vector-valued functions from to endowed with the -norm if the following assumptions are satisfied.
- (A1)
For any pairwise distinct sampling points , the matrix
[TABLE]
is invertible in the sense that there exists a , such that
[TABLE]
and
[TABLE]
where is the identity operator on , and is an matrix with diagonal entries and zero operator elsewhere. We simply denote by if no confusion is caused. 2. (A2)
The kernel is bounded. That is, there exists such that the operator norm
[TABLE]
for all . 3. (A3)
For any pairwise distinct points and , if for all , then for all . 4. (A4)
For any pairwise distinct ,
[TABLE]
is bounded above by , where is a linear operator from to .
We denote the corresponding assumptions for the scalar case in [34] by (A1′)–(A4′).
We make some remarks on the assumption (A1) in the Definition II.3 below. Note that for , we have and there do exist two linear operators , such that and . If both the linear operators are bounded, then most of the theoretic work in this paper would hold for . But unfortunately, for , there do not exist two bounded linear operators , , such that or . This is the main reason why we have to assume to be a finite-dimensional subspace of .
II-C Further preliminaries on matrix theory
We discuss some useful facts about the operator norm defined in (A4). For an operator matrix and a vector , we have the following compatible inequality for ,
[TABLE]
where denotes the -th column of . When the entries of are scalar-valued, .
Also, the following inversion of a blockwise matrix will be used many times in this paper:
[TABLE]
where .
III Construction
To begin with, we will use a similar method as in [34] to construct vector-valued Banach space with the norm based on a kernel satisfying (A2) and (A3) in Definition II.3.
Let be a given input space whose cardinality is infinite. We shall construct the following two RKBSs of vector-valued functions from to . The first one is
[TABLE]
with the norm
[TABLE]
And the second one is
[TABLE]
with the norm
[TABLE]
III-A The bilinear form and point evaluations
Denote
[TABLE]
with the norm
[TABLE]
and a linear space
[TABLE]
The above two linear spaces both consist of functions from to .
We then define a bilinear form on by
[TABLE]
where and for .
By (A3), we know that the norm in (8) and the above bilinear form in (9) are well-defined on their underlying spaces.
To proceed, we have to show that the point evaluation operators or defined as follows
[TABLE]
are continuous operators.
Proposition III.1**.**
The point evaluation operators are continuous on in the sense that
[TABLE]
*where is the constant in (A2). *
Proof.
Let with . Then we have
[TABLE]
This shows that the point evaluation operators are continuous on .
By [34, Proposition 2.4], we know that the norm defined as follows
[TABLE]
is well-defined. Moreover, by a similar reasoning as in Proposition III.1, we can show that the point evaluation operators on are continuous and
[TABLE]
for every .
The norm defined as in (10) has another equivalent but simpler form.
Proposition III.2**.**
For any , it holds that
[TABLE]
Proof.
By (11), we have . We shall prove the opposite direction. For any , there exist pairwise distinct points such that
[TABLE]
Then, we have for every ,
[TABLE]
It follows that , which completes the proof.
Until now, we have defined two normed vector spaces and , with their point evaluation functionals being continuous. There is also a bilinear form (9) defined on .
III-B Completion of and
With the previous preparations, we are now ready to complete and . Just like the classical completion process, we simply add elements into and to make them Banach spaces of functions. For convenience, we use the notation to represent or . Let be a Cauchy sequence in . Then by Proposition III.1 and the fact that is a Banach space, for any , the sequence is convergent to some point in . We denote this limit by , which defines a vector-valued function . It is easy to see that is well-defined. We then let be the set consist of all such limit vector-valued functions with the norm . Here, denote either or .
Since the rest of the completion process is the same as in [34], we only have a quick review and conclude the followings without proof.
By Proposition III.1 and [34, Proposition 2.3 and 3.1], we have
[TABLE]
By Proposition III.2 and [34, Proposition 2.5 and Lemma 3.3], we have
[TABLE]
Moreover, the bilinear form could be extended uniquely to such that the reproducing property in Definition II.2 holds true. That is,
[TABLE]
for every .
We conclude the above discussion as follows.
Theorem III.3**.**
Let be a kernel function satisfying (A2) and (A3). Then the spaces and , which are defined in (4) and (6) with their norm as in (5) and (7), respectively, satisfy
- (i)
they are both RKBSs of vector-valued functions from to with being their reproducing kernel; 2. (ii)
the bilinear form (9) could be extended to , which satisfies the reproducing property (12) and
[TABLE]
for every .
IV The Representer Theorem
The linear representer theorem is very important in regularized learning schemes in machine learning. It enables us to transform the optimization problem in an infinite-dimensional space to an equivalent one in a finite-dimensional subspace. The representer theorem for the regularized learning schemes on RKBSs and for the minimal norm interpolations are often related [24, 2, 34].
Here in this section, we use the assumptions (A1), (A2) and (A4) in Definition II.3 to deduce a corresponding representer theorem for the constructed vector-valued RKBSs and .
Recall that a linear operator between norm vector spaces is said to be completely continuous [10] on , if for any sequence weakly convergent to , converges to strongly. Note that every linear compact operator is completely continuous. For example, the projection from an infinite dimensional Banach space to its finite dimensional subspace is completely continuous. We borrow the terminology from this definition for general vector-valued functionals on Banach spaces.
Definition IV.1** (Acceptable Regularized Learning Schemes).**
Let be the set of pairwise distinct sampling points. For , denote . Let satisfy for any . Let and be a nondecreasing function. A regularized learning scheme
[TABLE]
*is said to be acceptable in if is completely continuous on , is continuous and . *
Note that if the space is a finite-dimensional vector space or the classical , then strongly continuity is equivalent to continuity.
Definition IV.2**.**
The space is said to satisfy the linear representer theorem for the acceptable regularized learning if every acceptable regularized learning scheme (14) has a minimizer of the form
[TABLE]
Denote
[TABLE]
One should be aware that although the space defined here is the “span” of with their coefficient in , but it may not be a finite-dimensional subspace of . That is why we impose the complete continuity on .
A minimal norm interpolant in with respect to is a function satisfying
[TABLE]
where . Without stated otherwise, we assume that always exists.
Definition IV.3**.**
*The space is said to satisfy the linear representer theorem for minimal norm interpolation if for arbitrary choice of training data , there is a minimal norm interpolant , obtained as in (16), lies in *
Similar ideas and techniques as those in [34, Lemma 4.4 and 4.5] lead to the following theorem.
Theorem IV.4**.**
*The space satisfies the linear representer theorem for acceptable regularized learning if and only if satisfies the linear representer theorem for minimal norm interpolation. *
Hence, to consider connections between the assumption (A4) and the acceptable regularized learning scheme is equivalent to considering the connections between (A4) and the minimal norm interpolation problem. The advantage for finding such equivalence is that the minimal norm interpolation problem is much easier to deal with. The following lemma confirms this fact.
Lemma IV.5**.**
Let consist of pairwise distinct elements in , , and set . Then
[TABLE]
*for every if and only if satisfies (***A4). **
Theorem IV.6**.**
*Every minimal norm interpolant of (16) in satisfies the linear representer theorem if and only if (A4) holds true. *
Proof.
We begin with the necessity. Note that the minimal norm interpolant of (16) satisfies the linear representer theorem if and only if
[TABLE]
Therefore, if the above equation holds true, then by the fact that , we obtain (17) and by Lemma IV.5, the assumption (A4) holds true for every .
Turning to the sufficiency, we notice
[TABLE]
To finish the proof we have to show that the reverse of the aforementioned inequality also holds true.
To this end, for any , we can express as for some and pairwise distinct . This is true since we can always add extra samplings from by setting the corresponding coefficients to zero, and relabelling if necessary. Let and
[TABLE]
Note that and and . Therefore we have
[TABLE]
Also, by Lemma IV.5 and the fact that ,
[TABLE]
Thus, we have
[TABLE]
Repeat this process until (18) holds true for .
For a general , a limiting process would do the work. In fact, let be the sequence that converges to in . If we take as follows
[TABLE]
Since as and the point evaluation functionals are continuous on , for as . As a consequence,
[TABLE]
Since we already knew that for all , the inequality
[TABLE]
follows by taking the limit. The proof is complete.
Combining Theorem III.3 with Theorem IV.4 and IV.6, we have the following corollary for any .
Corollary IV.1**.**
Let satisfy (A1)-(A3) as in Definition II.3. Then it induces an RKBS and the following three statements are equivalent:
- (a)
The kernel satisfies the assumption (A4). 2. (b)
Every acceptable regularized learning scheme in of the form (14) has a minimizer with the form (15). 3. (c)
Every minimal norm interpolant (16) in satisfies the linear representer theorem.
We comment that if satisfies (A4), then also satisfies the linear representer theorem for the acceptable regularized learning. For more details, we recommend [34, Theorem 4.12 and Proposition 4.13].
We finish this section by stating the following conclusion.
Theorem IV.7**.**
If is an admissible kernel on , then and as defined in Section 3, with their norms defined as in (5) and (7) respectively, are both vector-valued RKBSs on . And the bilinear form satisfies (12) and the Cauchy inequality (13)
Furthermore, every acceptable regularized learning scheme as in Definition II.3, has a minimizer of the form
[TABLE]
for some .
*The converse is also true. That is, for the constructed spaces and to enjoy the above properties, must be an admissible kernel on . *
V Admissible Kernels
We have seen that admissible kernels are fundamental to our construction. We give examples of admissible kernels in this section.
Recall the term in (A4), which usually refers to the Lebesgue constant [18] of the kernel that measures the stability of the kernel-based interpolation.
Define
[TABLE]
to be the Lebesgue constant of a kernel , where is a finite subset of and is some specified norm. For example, corresponds to the the classical Hilbert norm and to the -norm. We desire for kernels such that
[TABLE]
It is shown in [34] that both the Brownian bridge kernel
[TABLE]
and the exponential kernel
[TABLE]
are admissible scalar-valued kernels. Here we present a new family of admissible scalar-valued kernels. We can then utilize these scalar-valued kernels to construct admissible operator-valued kernels for our purpose in [1, 7]:
[TABLE]
where is a single-task kernel and denotes a positive-definite matrix.
V-A A new family of admissible scalar-valued kernels
The new family is
[TABLE]
It contains the Brownian bridge kernel when . When , it is the covariance of the Brownian motion .
Proposition V.1**.**
*The family of functions in (20) are admissible kernels. *
Proof.
Let and . An easy computation shows that the determinant of the kernel matrix is . Then is strictly positive definite for any and therefore satisfies the assumption (A1′). The function is clearly uniformly bounded by for . Also, by the same reasoning as in [34, Proposition 5.1], we can verify that satisfies (A3′) and (A4′) for .
V-B Admissible kernel for multi-task learning
We will show that the multi-task kernel defined in (19) is admissible whenever is. Let be an scalar-valued kernel and is an invertible operator as in (A2). Then we have the following lemma.
Lemma V.2**.**
*Let be a multi-task kernel given as in (19) and be a set of pairwise distinct points. If the Lebesgue constant is bounded by , then so is . *
Proof.
We compute
[TABLE]
Then we have
[TABLE]
which completes the proof.
The following lemma follows directly from Lemma V.2.
Corollary V.1**.**
*Let be defined as in Lemma V.2. If the Lebesgue constant is uniformly bounded by for all pairwise distinct points , then . *
The connections between the assumptions (A3′) and (A3) are stated as below.
Lemma V.3**.**
Let be a kernel function satisfying (**A3′***), then for any invertible operator , satisfies (A3). *
Proof.
Let be pairwise distinct points in and . Suppose that for every . Then the sequence converges. Since we have
[TABLE]
As a consequence, coordinately for every . Then we know that for every . That is, .
It follows from the above Lemma that (A3) is automatically satisfied by the kernel with the form . Then we are now ready to present the following proposition.
Proposition V.4**.**
*Let be an admissible scalar-valued kernel, and an invertible operator in . Then is also an admissible kernel. *
Proof.
Note that the assumption (A1) follows from the fact that is invertible and strictly positive, and (A2) follows by . By Lemma V.3, (A3) holds true provided that satisfies (A3′). Finally, by Lemma V.2 and (A4′), (A4) holds.
As a conclusion, we know that, for any invertible operator ,
[TABLE]
are all admissible multi-task kernels.
V-C More admissible kernels
The Wendland’s kernel function in [42] has some well-behaved properties and is widely used in interpolation and kernel based learning problems. We consider restriction form of the Wendland’s function
[TABLE]
We are able to show that positive linear combinations of and have Lebesgue constants bounded above by .
We have the following result for .
Proposition V.5**.**
The Wendland kernel in (21) satisfies (**A4′***). *
Also, the positive linear combinations of and still have their Lebesgue constants being bounded by 1. Denote
[TABLE]
Proposition V.6**.**
The following class of kernel functions
[TABLE]
satisfies (**A4′***). *
The proof of the above proposition relies on much mathematics and is available in the full version of this paper on arXiv.
VI Conclusion
We established a theory for multi-task learning in vector-valued RKBS with -norms. These norms include the classical -norm and the group lasso norm. We explicitly construct the vector-valued RKBS by using admissible kernel functions. We prove that the representer theorem for acceptable learning schemes, the representer theorem for minimal norm interpolation, and the admissible assumption (A4) are all equivalent. As for admissible kernels, we present a new family of admissible scalar-valued kernels and based on which we construct admissible kernels for multi-task learning.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. A. Álvarez, L. Rosasco, and N. D. Lawrence, Kernels for vector-valued functions: A review, Found. Trends Mach. Learn. 4 (2012), 195-266.
- 2[2] A. Argyriou, C. A. Micchelli, and M. Pontil, When is there a representer theorem? Vector versus matrix regularizers, J. Mach. Learn. Res. 10 (2009), 2507-2529.
- 3[3] N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 68 (1950), 337-404.
- 4[4] A. Berlinet, and C. Thomas-Agnan, Reproducing Kernel Hilbert Spaces in Probability and Statistics , Kluwer Academic Publishers, Boston, MA, 2004.
- 5[5] J. Burbea, and P. Masani, Banach and Hilbert Spaces of Vector-valued Functions , Research Notes in Mathematics 90, Pitman Publishers, Boston, MA, 1984.
- 6[6] E. J. Cand e ` ` e {\rm\grave{e}} s, J. Romberg, and T. Tao, Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information, IEEE Trans. Inform. Theory 52 (2006), 489-509.
- 7[7] A. Caponnetto, C. A. Micchelli, M. Pontil, and Y. Ying, Universal multi-task kernels, J. Mach. Learn. Res. 9 (2008), 1615-1646.
- 8[8] C. Carmeli, E. De Vito, A. Toigo, and V. Umanit a ` ` a {\rm\grave{a}} , Vector-valued reproducing kernel Hilbert spaces and universality, Anal. Appl. (Singap.) 8 (2010), 19-61.
