On the Sample Complexity of HGR Maximal Correlation Functions for Large   Datasets

Shao-Lun Huang; Xiangxiang Xu

arXiv:1907.00393·cs.IT·September 15, 2021

On the Sample Complexity of HGR Maximal Correlation Functions for Large Datasets

Shao-Lun Huang, Xiangxiang Xu

PDF

TL;DR

This paper analyzes the sample complexity of estimating HGR maximal correlation functions using the ACE algorithm on large datasets, providing theoretical bounds, optimal sampling strategies, and supporting simulations.

Contribution

It develops a mathematical framework for understanding learning errors and error exponents in estimating HGR functions, and proposes an optimal sampling strategy for semi-supervised learning.

Findings

01

Derived analytical expressions for error exponents.

02

Established upper bounds for sample complexity.

03

Proposed an optimal sampling strategy to maximize error exponents.

Abstract

The Hirschfeld-Gebelein-R\'{e}nyi (HGR) maximal correlation and the corresponding functions have been shown useful in many machine learning scenarios. In this paper, we study the sample complexity of estimating the HGR maximal correlation functions by the alternating conditional expectation (ACE) algorithm using training samples from large datasets. Specifically, we develop a mathematical framework to characterize the learning errors between the maximal correlation functions computed from the true distribution, and the functions estimated from the ACE algorithm. For both supervised and semi-supervised learning scenarios, we establish the analytical expressions for the error exponents of the learning errors. Furthermore, we demonstrate that for large datasets, the upper bounds for the sample complexity of learning the HGR maximal correlation functions by the ACE algorithm can be…

Figures2

Click any figure to enlarge with its caption.

Equations759

ρ (X; Y) ≜ f : X \mapsto R, g : Y \mapsto R max E [f (X) g (Y)],

ρ (X; Y) ≜ f : X \mapsto R, g : Y \mapsto R max E [f (X) g (Y)],

ρ_{k} (X; Y) ≜ f : X \mapsto R^{k}, g : Y \mapsto R^{k} E [f (X)] = E [g (Y)] = 0 E [f (X) f^{T} (X)] = E [g (Y) g^{T} (Y)] = I max E [f^{T} (X) g (Y)],

ρ_{k} (X; Y) ≜ f : X \mapsto R^{k}, g : Y \mapsto R^{k} E [f (X)] = E [g (Y)] = 0 E [f (X) f^{T} (X)] = E [g (Y) g^{T} (Y)] = I max E [f^{T} (X) g (Y)],

B (y, x) = \frac{P _{X Y} ( x , y )}{P _{X} ( x ) P _{Y} ( y )},

B (y, x) = \frac{P _{X Y} ( x , y )}{P _{X} ( x ) P _{Y} ( y )},

[f^{*} (1) P_{X} (1), \dots, f^{*} (∣ X ∣) P_{X} (∣ X ∣)]^{T} and [g^{*} (1) P_{Y} (1), \dots, g^{*} (∣ Y ∣) P_{Y} (∣ Y ∣)]^{T}

[f^{*} (1) P_{X} (1), \dots, f^{*} (∣ X ∣) P_{X} (∣ X ∣)]^{T} and [g^{*} (1) P_{Y} (1), \dots, g^{*} (∣ Y ∣) P_{Y} (∣ Y ∣)]^{T}

\tilde{B} (y, x) = B (y, x) - P_{X} (x) P_{Y} (y) = \frac{P _{X Y} ( x , y ) - P _{X} ( x ) P _{Y} ( y )}{P _{X} ( x ) P _{Y} ( y )},

\tilde{B} (y, x) = B (y, x) - P_{X} (x) P_{Y} (y) = \frac{P _{X Y} ( x , y ) - P _{X} ( x ) P _{Y} ( y )}{P _{X} ( x ) P _{Y} ( y )},

Λ_{\hat{f}} = \frac{1}{n} i = 1 \sum n \hat{f} (x_{i}) \hat{f}^{T} (x_{i}), Λ_{\overset{g}{^}} = \frac{1}{n} i = 1 \sum n \overset{g}{^} (y_{i}) \overset{g}{^}^{T} (y_{i}) .

Λ_{\hat{f}} = \frac{1}{n} i = 1 \sum n \hat{f} (x_{i}) \hat{f}^{T} (x_{i}), Λ_{\overset{g}{^}} = \frac{1}{n} i = 1 \sum n \overset{g}{^} (y_{i}) \overset{g}{^}^{T} (y_{i}) .

\hat{ϕ}_{i} (x) = \hat{P}_{X} (x) \hat{f}_{i} (x), \hat{ψ}_{i} (y) = \hat{P}_{Y} (y) \overset{g}{^}_{i} (y),

\hat{ϕ}_{i} (x) = \hat{P}_{X} (x) \hat{f}_{i} (x), \hat{ψ}_{i} (y) = \hat{P}_{Y} (y) \overset{g}{^}_{i} (y),

\hat{f} (x) = [\hat{f}_{1} (x), \dots, \hat{f}_{k} (x)]^{T}, \overset{g}{^} (y) = [\overset{g}{^}_{1} (y), \dots, \overset{g}{^}_{k} (y)]^{T},

\hat{f} (x) = [\hat{f}_{1} (x), \dots, \hat{f}_{k} (x)]^{T}, \overset{g}{^} (y) = [\overset{g}{^}_{1} (y), \dots, \overset{g}{^}_{k} (y)]^{T},

\hat{B} (y, x) = \frac{P ^ _{X Y} ( x , y )}{P ^ _{X} ( x ) P ^ _{Y} ( y )} - \hat{P}_{X} (x) \hat{P}_{Y} (y),

\hat{B} (y, x) = \frac{P ^ _{X Y} ( x , y )}{P ^ _{X} ( x ) P ^ _{Y} ( y )} - \hat{P}_{X} (x) \hat{P}_{Y} (y),

i) \hat{Φ}_{k}

i) \hat{Φ}_{k}

ii) \hat{Ψ}_{k}

\hat{Φ}_{k} = [\hat{ϕ}_{1}, \dots, \hat{ϕ}_{k}], \hat{Ψ}_{k} = [\hat{ψ}_{1}, \dots, \hat{ψ}_{k}] .

\hat{Φ}_{k} = [\hat{ϕ}_{1}, \dots, \hat{ϕ}_{k}], \hat{Ψ}_{k} = [\hat{ψ}_{1}, \dots, \hat{ψ}_{k}] .

\hat{Ψ}_{k}, \hat{Φ}_{k} min \hat{B} - \hat{Ψ}_{k} \hat{Φ}_{k}^{T}_{F}^{2} .

\hat{Ψ}_{k}, \hat{Φ}_{k} min \hat{B} - \hat{Ψ}_{k} \hat{Φ}_{k}^{T}_{F}^{2} .

Φ_{k} ≜ [ϕ_{1}, \dots, ϕ_{k}],

Φ_{k} ≜ [ϕ_{1}, \dots, ϕ_{k}],

∥ ϕ_{1} - \hat{ϕ}_{1} ∥^{2} \leq \frac{2}{σ _{1}^{2} - σ _{2}^{2}} (∥ \tilde{B} ϕ_{1} ∥^{2} - ∥ \tilde{B} \hat{ϕ}_{1} ∥^{2}),

∥ ϕ_{1} - \hat{ϕ}_{1} ∥^{2} \leq \frac{2}{σ _{1}^{2} - σ _{2}^{2}} (∥ \tilde{B} ϕ_{1} ∥^{2} - ∥ \tilde{B} \hat{ϕ}_{1} ∥^{2}),

\tilde{B} \hat{Φ}_{k}_{F}^{2} = \tilde{B}_{F}^{2} - \hat{Ψ}_{k} min \tilde{B} - \hat{Ψ}_{k} \hat{Φ}_{k}^{T}_{F}^{2} .

\tilde{B} \hat{Φ}_{k}_{F}^{2} = \tilde{B}_{F}^{2} - \hat{Ψ}_{k} min \tilde{B} - \hat{Ψ}_{k} \hat{Φ}_{k}^{T}_{F}^{2} .

E_{k} ≜ - ϵ \to 0^{+} lim \frac{1}{ϵ} n \to \infty lim \frac{1}{n} lo g P_{n} {\tilde{B} Φ_{k}_{F}^{2} - \tilde{B} \hat{Φ}_{k}_{F}^{2} > ϵ},

E_{k} ≜ - ϵ \to 0^{+} lim \frac{1}{ϵ} n \to \infty lim \frac{1}{n} lo g P_{n} {\tilde{B} Φ_{k}_{F}^{2} - \tilde{B} \hat{Φ}_{k}_{F}^{2} > ϵ},

V_{k} ≜ [v_{1}, \dots, v_{k}] \in R^{d \times k}

V_{k} ≜ [v_{1}, \dots, v_{k}] \in R^{d \times k}

tr {V_{k}^{T} A V_{k}} = i = 1 \sum k λ_{i},

tr {V_{k}^{T} A V_{k}} = i = 1 \sum k λ_{i},

A (τ) = A + τ A^{'} + o (τ),

A (τ) = A + τ A^{'} + o (τ),

\displaystyle\operatorname*{tr}\left\{\mathbf{V}_{k}^{\mathrm{T}}(\tau)\mathbf{A}\mathbf{V}_{k}(\tau)\right\}=\operatorname*{tr}\left\{\mathbf{V}_{k}^{\mathrm{T}}\mathbf{A}\mathbf{V}_{k}\right\}-\tau^{2}\sum_{i=1}^{k}\sum_{j=k+1}^{d}\frac{\bigl{(}\bm{v}_{i}^{\mathrm{T}}\mathbf{A}^{\prime}\bm{v}_{j}\bigr{)}^{2}}{\lambda_{i}-\lambda_{j}}+o(\tau^{2}),

\displaystyle\operatorname*{tr}\left\{\mathbf{V}_{k}^{\mathrm{T}}(\tau)\mathbf{A}\mathbf{V}_{k}(\tau)\right\}=\operatorname*{tr}\left\{\mathbf{V}_{k}^{\mathrm{T}}\mathbf{A}\mathbf{V}_{k}\right\}-\tau^{2}\sum_{i=1}^{k}\sum_{j=k+1}^{d}\frac{\bigl{(}\bm{v}_{i}^{\mathrm{T}}\mathbf{A}^{\prime}\bm{v}_{j}\bigr{)}^{2}}{\lambda_{i}-\lambda_{j}}+o(\tau^{2}),

tr {V_{k}^{T} (τ) A V_{k} (τ)} = tr {V_{k}^{T} A V_{k}} - τ^{2} i = 1 \sum l - 1 j = l \sum d \frac{( v _{i}^{T} A ^{'} v _{j} ) ^{2}}{λ _{i} - λ _{j}} - τ^{2} i = l \sum k j \in I_{k}^{c} \sum \frac{( v ^ _{i}^{T} A ^{'} v _{j} ) ^{2}}{λ _{k} - λ _{j}} + o (τ^{2}),

tr {V_{k}^{T} (τ) A V_{k} (τ)} = tr {V_{k}^{T} A V_{k}} - τ^{2} i = 1 \sum l - 1 j = l \sum d \frac{( v _{i}^{T} A ^{'} v _{j} ) ^{2}}{λ _{i} - λ _{j}} - τ^{2} i = l \sum k j \in I_{k}^{c} \sum \frac{( v ^ _{i}^{T} A ^{'} v _{j} ) ^{2}}{λ _{k} - λ _{j}} + o (τ^{2}),

\hat{v}_{i} ≜ V_{I_{k}} u_{i - l + 1}, l \leq i \leq k,

\hat{v}_{i} ≜ V_{I_{k}} u_{i - l + 1}, l \leq i \leq k,

G_{k} ≜ L^{T} i = 1 \sum k j = k + 1 \sum d \frac{θ _{ij} θ _{ij}^{T}}{σ _{i}^{2} - σ _{j}^{2}} L,

G_{k} ≜ L^{T} i = 1 \sum k j = k + 1 \sum d \frac{θ _{ij} θ _{ij}^{T}}{σ _{i}^{2} - σ _{j}^{2}} L,

\displaystyle\sqrt{\frac{P_{XY}(x^{\prime},y^{\prime})}{P_{X}(x)P_{Y}(y)}}\biggl{[}\delta_{xx^{\prime}}\delta_{yy^{\prime}}-\frac{1}{2}\left(\frac{\delta_{xx^{\prime}}}{P_{X}(x)}+\frac{\delta_{yy^{\prime}}}{P_{Y}(y)}\right)\cdot\left[P_{XY}(x,y)+P_{X}(x)P_{Y}(y)\right]\biggr{]},

\displaystyle\sqrt{\frac{P_{XY}(x^{\prime},y^{\prime})}{P_{X}(x)P_{Y}(y)}}\biggl{[}\delta_{xx^{\prime}}\delta_{yy^{\prime}}-\frac{1}{2}\left(\frac{\delta_{xx^{\prime}}}{P_{X}(x)}+\frac{\delta_{yy^{\prime}}}{P_{Y}(y)}\right)\cdot\left[P_{XY}(x,y)+P_{X}(x)P_{Y}(y)\right]\biggr{]},

\displaystyle\bm{\theta}_{ij}\triangleq\bm{\phi}_{j}\otimes\bigl{(}\tilde{\mathbf{B}}\bm{\phi}_{i}\bigr{)}+\bm{\phi}_{i}\otimes\bigl{(}\tilde{\mathbf{B}}\bm{\phi}_{j}\bigr{)},\quad 1\leq i\leq j\leq d,

\displaystyle\bm{\theta}_{ij}\triangleq\bm{\phi}_{j}\otimes\bigl{(}\tilde{\mathbf{B}}\bm{\phi}_{i}\bigr{)}+\bm{\phi}_{i}\otimes\bigl{(}\tilde{\mathbf{B}}\bm{\phi}_{j}\bigr{)},\quad 1\leq i\leq j\leq d,

E_{k} = - ϵ \to 0^{+} lim \frac{1}{ϵ} n \to \infty lim \frac{1}{n} lo g P_{n} {\tilde{B} Φ_{k}_{F}^{2} - \tilde{B} \hat{Φ}_{k}_{F}^{2} > ϵ} = \frac{1}{2 α _{k}} .

E_{k} = - ϵ \to 0^{+} lim \frac{1}{ϵ} n \to \infty lim \frac{1}{n} lo g P_{n} {\tilde{B} Φ_{k}_{F}^{2} - \tilde{B} \hat{Φ}_{k}_{F}^{2} > ϵ} = \frac{1}{2 α _{k}} .

P_{n} {\tilde{B} Φ_{k}_{F}^{2} - \tilde{B} \hat{Φ}_{k}_{F}^{2} > ϵ} < δ

P_{n} {\tilde{B} Φ_{k}_{F}^{2} - \tilde{B} \hat{Φ}_{k}_{F}^{2} > ϵ} < δ

N^{(t)} (ϵ, δ) ≜ \frac{t ∣ X ∣∣ Y ∣}{ϵ} lo g \frac{6 t ∣ X ∣∣ Y ∣}{ϵ} + \frac{t}{ϵ} lo g \frac{1}{δ},

N^{(t)} (ϵ, δ) ≜ \frac{t ∣ X ∣∣ Y ∣}{ϵ} lo g \frac{6 t ∣ X ∣∣ Y ∣}{ϵ} + \frac{t}{ϵ} lo g \frac{1}{δ},

E_{k} = \frac{2}{σ _{1}^{2}} .

E_{k} = \frac{2}{σ _{1}^{2}} .

S_{1} (ϵ) ≜ {\hat{P}_{X Y} : \tilde{B} Φ_{k}_{F}^{2} - \tilde{B} \hat{Φ}_{k}_{F}^{2} > ϵ},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On the Sample Complexity of HGR Maximal Correlation Functions for Large Datasets

Shao-Lun Huang, and Xiangxiang Xu This paper was presented in part at the Inform. Theory Workshop (ITW-2019), Visby, Sweden, Aug. 2019 and at Allerton Conf. Commun., Contr., Computing (Allerton-2019), Monticello, IL, Sep. 2019.S.-L. Huang is with the Data Science and Information Technology Research Center, Tsinghua–Berkeley Shenzhen Institute, Shenzhen, China (e-mail: [email protected]).X. Xu is with the Department of Electronic Engineering, Tsinghua University, Beijing, China (e-mail: [email protected]).

Abstract

The Hirschfeld–Gebelein–Rényi (HGR) maximal correlation and the corresponding functions have been shown useful in many machine learning scenarios. In this paper, we study the sample complexity of estimating the HGR maximal correlation functions by the alternating conditional expectation (ACE) algorithm using training samples from large datasets. Specifically, we develop a mathematical framework to characterize the learning errors between the maximal correlation functions computed from the true distribution, and the functions estimated from the ACE algorithm. For both supervised and semi-supervised learning scenarios, we establish the analytical expressions for the error exponents of the learning errors. Furthermore, we demonstrate that for large datasets, the upper bounds for the sample complexity of learning the HGR maximal correlation functions by the ACE algorithm can be expressed using the established error exponents. Moreover, with our theoretical results, we investigate the sampling strategy for different types of samples in semi-supervised learning with a total sampling budget constraint, and an optimal sampling strategy is developed to maximize the error exponent of the learning error. Finally, the numerical simulations are presented to support our theoretical results.

Index Terms:

error exponent, sample complexity, HGR maximal correlation, ACE algorithm, supervised learning, semi-supervised learning, singular value decomposition, generalization error

I Introduction

Learning informative and generalizable representations of data is a crucial issue in machine learning [1]. To measure the correlation and select informative features, the Hirschfeld–Gebelein–Rényi (HGR) maximal correlation [2, 3, 4] is a normalized measure of the dependence between two random variables and has been widely applied as an information metric to study inference and learning problems [5, 6, 7]. Specifically, given a pair of jointly distributed discrete random variables $X,Y$ over finite alphabets ${\mathcal{X}},{\mathcal{Y}}$ , their HGR maximal correlation $\rho(X;Y)$ is defined as

[TABLE]

where the maximum is taken over all functions $f,g$ with zero mean and unit variance. Therefore, the HGR maximal correlation characterizes the correlation between the most correlated function mappings of $X$ and $Y$ , and the optimal functions $f,g$ that achieve the maximal correlation essentially extract the most correlated aspects between $X$ and $Y$ . Recently, the HGR maximal correlation has been further generalized to consider the correlation in the $k$ -dimensional functional spaces by defining [8]

[TABLE]

of which the special case with $k=1$ corresponds to the original problem (1). In particular, the optimal functions $f^{*},g^{*}$ maximizing (2), referred to as the maximal correlation functions, have been shown to take important roles in statistics [6], information theory [8, 9], machine learning [10, 11, 12], and especially in interpreting deep neural networks [13]. For example, in machine learning scenarios, the variable $Y$ can be viewed as the label, and $X$ is the data variable that is used to infer or predict about attributes of $Y$ . Then, $f^{*}$ can be illustrated as the optimal feature to predict $Y$ , with $g^{*}$ being the corresponding weights [12]. Therefore, efficiently and effectively computing maximal correlation functions from data is important in information theory and machine learning.

In this paper, we study the sample complexity of estimating the maximal correlation functions from a sequence of $n$ training samples $(x_{1},y_{1}),\ldots,(x_{n},y_{n})$ , i.i.d. generated from the (unknown) true distribution $P_{XY}$ , by the widely adopted alternating conditional expectation (ACE) algorithm [14]. Mathematically, the ACE algorithm computes the maximal correlation functions over the empirical joint distribution $\hat{P}_{XY}$ of the training samples. Therefore, there exists a learning error between the true maximal correlation functions and the computed functions due to the i.i.d. sampling process. To quantify this learning error, for functions computed from the ACE algorithm, we apply the H-score introduced by [13] as the performance metric. It has been shown that the H-score of a function indicates the performance of employing that function as the input feature to the softmax regression [13], and hence is a meaningful information metric in machine learning applications. Then, we study the large deviation property of the H-score for the functions computed from the ACE algorithm, in which we characterize the error exponent of the learning error in the asymptotic regime, i.e., when $n$ tends to infinity. In particular, we establish the analytical solutions for this error exponent expressed by the true distribution $P_{XY}$ . Furthermore, for large datasets, we demonstrate an upper bound for the sample complexity of learning the maximal correlation functions, which can be expressed using the established error exponent. Our results also provide insights in designing the dimensionality $k$ for the selected features to effectively extract correlation structures among data variables in machine learning problems.

In addition, we investigate the sample complexity of the maximal correlation functions in the semi-supervised learning [15] scenario, in which not only the labeled samples, but also a sequence of $m$ unlabeled training samples $x_{n+1},\ldots,x_{n+m}$ is observed. In this case, the empirical joint distribution can in general be learned more accurately, since the marginal distribution of $X$ is trained better due to the unlabeled samples. Thus, the sample complexity is expected to be improved. To quantify this performance gain, we first propose a generalized ACE algorithm to deal with the unlabeled training samples, and then study the sample complexity of estimating the maximal correlation functions from the generalized ACE algorithm. As in the supervised case, we develop the closed form expressions for the error exponent of the learning error, and demonstrate the performance gain from the unlabeled samples. In addition, the theoretical results are applied to study the optimal sampling strategy, when the labeled and unlabeled samples have different acquiring costs. In particular, we formulate an optimization problem to maximize the error exponent of estimating the maximal correlation functions from different types of samples, subjected to a total budget constraint on acquiring these samples. The solution of this optimization problem then illustrates the optimal design of selecting different types of samples in semi-supervised learning problems. Finally, our theoretical results are supported by numerical simulations.

The rest of this paper is organized as follows. In Section II, we formulate the sample complexity problem of the maximal correlation functions and define the corresponding error exponent of the test error, and the mathematical framework for computing this error exponent is presented in Section III. In Section IV, we establish the analytical expression for the error exponent in the supervised learning scenario, in which the number of required samples for computing the maximal correlation functions on large datasets is provided. Then, similar analyses of the error exponent and the number of samples required for the semi-supervised learning are presented in Section V, in which we develop the optimal sampling strategy for semi-supervised learning with a sampling budget constraint. Finally, the numerical simulations are presented in Section VI to support our theoretical results.

II Problem Formulation

We commence by briefly introducing the singular value decomposition (SVD) structure of the HGR maximal correlation problem and the ACE algorithm for computing the maximal correlation functions. Given the joint distribution111We assume that the true marginal distributions $P_{X}(x)>0$ , and $P_{Y}(y)>0$ , for all $x,y$ , since otherwise we can remove the symbols with probability 0 from the alphabets. $P_{XY}$ , the HGR maximal correlation (1) is known to be the second largest singular value of the matrix $\mathbf{B}\in\mathbb{R}^{|{\mathcal{Y}}|\times|{\mathcal{X}}|}$ , also referred to as the divergence transition matrix (DTM) [16], whose entries are [17]

[TABLE]

and the maximal correlation functions $f^{*},g^{*}$ are chosen such that the vectors222In our development, we may simply take the alphabets ${\mathcal{X}}=\{1,2,\ldots,|{\mathcal{X}}|\}$ and ${\mathcal{Y}}=\{1,2,\ldots,|{\mathcal{Y}}|\}$ , which corresponds to some given alphabet orders of random variables $X$ and $Y$ .

[TABLE]

are the right and left singular vector of $\mathbf{B}$ associated with the second largest singular value, respectively. It can be shown that he largest singular value of $\mathbf{B}$ is $1$ , with the corresponding right and left singular vectors being $\left[\sqrt{P_{X}(1)},\dots,\sqrt{P_{X}(|{\mathcal{X}}|)}\right]^{\mathrm{T}}$ and $\left[\sqrt{P_{Y}(1)},\dots,\sqrt{P_{Y}(|{\mathcal{Y}}|)}\right]^{\mathrm{T}}$ , and thus it would be more convenient to subtract the top singular mode and introduce the matrix $\tilde{\mathbf{B}}\in\mathbb{R}^{|{\mathcal{Y}}|\times|{\mathcal{X}}|}$ with entries

[TABLE]

referred to as the canonical dependence matrix (CDM) [18]. Then, the HGR maximal correlation and the 1-dimensional maximal correlation functions can be represented by the largest singular value and the corresponding singular vectors of the $\tilde{\mathbf{B}}$ . It can be shown that the generalized HGR maximal correlation (2) has retained this SVD structure [18]. Specifically, the maximal correlation $\rho_{k}(X;Y)$ is the sum of the largest $k$ singular values (i.e., the Ky Fan $k$ -norm) of $\tilde{\mathbf{B}}$ , and the maximal correlation functions $f^{*},g^{*}$ correspond to its top $k$ right and left singular vectors, respectively.

In practical learning applications, since $X$ and $Y$ can have large alphabets or even be continuous, the matrix $\tilde{\mathbf{B}}$ cannot be easily estimated from data samples for computing the maximal correlation functions. Instead, we can use the ACE algorithm [14] to iteratively compute the maximal correlation functions, which is mathematically equivalent to the power method on $\tilde{\mathbf{B}}$ . In particular, given a sequence of $n$ training samples $(x_{1},y_{1}),\ldots,(x_{n},y_{n})$ , i.i.d. generated from the joint distribution $P_{XY}$ , the ACE algorithm estimates the $k$ -dimensional maximal correlation functions of (2) can be summarized as Algorithm 1333There are also other variants of ACE algorithm for computing $k$ -dimensional maximal correlation functions using different numerical techniques for normalization, see, e.g., [8, Algorithm 3] and [18, Algorithm 1]., where the expectations are taken over the empirical distributions $\hat{P}_{XY}$ , or conditional distributions $\hat{P}_{Y|X}$ and $\hat{P}_{X|Y}$ from the training samples. In addition, $\mathbf{\Lambda}_{\hat{f}}$ and $\mathbf{\Lambda}_{\hat{g}}$ are the empirical covariance matrices defined as

[TABLE]

Now, let us define the $|{\mathcal{X}}|$ and $|{\mathcal{Y}}|$ dimensional vectors $\hat{\bm{\phi}}_{i}$ and $\hat{\bm{\psi}}_{i}$ , respectively, for $i=1,\ldots,k$ as

[TABLE]

where $\hat{P}_{X}$ and $\hat{P}_{Y}$ are the empirical marginal distributions, and $\hat{f}_{i}$ and $\hat{g}_{i}$ are the $i$ -th dimension of $\hat{f}$ and $\hat{g}$ , i.e.,

[TABLE]

for all $x,y$ . In addition, we define $\hat{\mathbf{B}}\in\mathbb{R}^{|{\mathcal{Y}}|\times|{\mathcal{X}}|}$ as

[TABLE]

which is the CDM matrix for training samples. Then, the key steps of Algorithm 1 that alternatively compute conditional expectations (cf. line 4–6) can be equivalently expressed as alternating iterations [13, Eq. (10) and (12)]

[TABLE]

until $\left\|\hat{\mathbf{B}}\right\|_{\mathrm{F}}^{2}-\left\|\hat{\mathbf{B}}-\hat{{\mathbb{\Psi}}}_{k}\hat{{\mathbb{\Phi}}}_{k}^{\mathrm{T}}\right\|_{\mathrm{F}}^{2}$ stops increasing, where

[TABLE]

Note that (6) in fact coincides with the alternating least squares algorithm [19] for solving the low-rank approximation problem

[TABLE]

Therefore, from the Eckart–Young–Mirsky theorem [20], Algorithm 1 essentially computes the singular vectors of $\hat{\mathbf{B}}$ with respect to the top $k$ singular values, with more detailed illustrations provided in Appendix A for completeness. In the following, we simply use $\hat{\bm{\phi}}_{i}$ to denote the $i$ -th right singular vector of $\hat{\mathbf{B}}$ , and $\hat{{\mathbb{\Phi}}}_{k}$ is the $|{\mathcal{X}}|\times k$ matrix formed by the top $k$ right singular vectors.

As discussed above, the maximal correlation functions of (2) correspond to the top $k$ singular vectors [cf. (4)] of the matrix $\tilde{\mathbf{B}}$ as defined in (3). Therefore, if the empirical distribution $\hat{P}_{XY}$ coincides with the true distribution $P_{XY}$ , then the matrix $\hat{\mathbf{B}}$ satisfies $\hat{\mathbf{B}}=\tilde{\mathbf{B}}$ , and the ACE algorithm outputs the maximal correlation functions of (2) precisely. However, since the training samples are i.i.d. sampled from $P_{XY}$ , the empirical distribution might deviate from the true distribution, which leads to deviations between the true singular vectors and the ones computed from the ACE algorithm. In order to quantify this deviation, we define the $|{\mathcal{X}}|\times k$ matrix ${\mathbb{\Phi}}_{k}$ as

[TABLE]

where $\bm{\phi}_{i}$ is the $i$ -th right singular vector of $\tilde{\mathbf{B}}$ . Then, we apply the difference of the Frobenius norm-squares $\left\|\tilde{\mathbf{B}}{\mathbb{\Phi}}_{k}\right\|_{\mathrm{F}}^{2}-\left\|\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\|_{\mathrm{F}}^{2}$ as the measurement to quantify how $\hat{{\mathbb{\Phi}}}_{k}$ deviates from ${\mathbb{\Phi}}_{k}$ . It is worth emphasizing that this measurement represents the test error of learned singular vectors and can be more effective than directly computing the difference between ${\mathbb{\Phi}}_{k}$ and $\hat{{\mathbb{\Phi}}}_{k}$ . For example, consider the simplest case $k=1$ , and let $\sigma_{i}$ denote the $i$ -th singular value of $\tilde{\mathbf{B}}$ . Without loss of generality, we assume that the estimated $\hat{\bm{\phi}}_{1}$ satisfies $\langle\hat{\bm{\phi}}_{1},\bm{\phi}_{1}\rangle\geq 0$ , since if not we can use $-\hat{\bm{\phi}}_{1}$ as the estimated singular vector. Then, it is shown in Appendix B that when $\sigma_{1}>\sigma_{2}$ , we have

[TABLE]

where the Frobenius norm becomes $\ell_{2}$ -norm since $k=1$ . Therefore, a small error in $\|\tilde{\mathbf{B}}\bm{\phi}_{1}\|^{2}-\|\tilde{\mathbf{B}}\hat{\bm{\phi}}_{1}\|^{2}$ implies a small error measured in $\|\bm{\phi}_{1}-\hat{\bm{\phi}}_{1}\|$ . However, when $\sigma_{1}=\sigma_{2}$ , both $\bm{\phi}_{1}$ and $\bm{\phi}_{2}$ (and thus any linear combination of $\bm{\phi}_{1}$ and $\bm{\phi}_{2}$ ) correspond to the maximal correlation function. While the measurement $\|\tilde{\mathbf{B}}\bm{\phi}_{1}\|^{2}-\|\tilde{\mathbf{B}}\hat{\bm{\phi}}_{1}\|^{2}$ is able to indicate zero learning error for both optimal choices $\hat{\bm{\phi}}_{1}=\bm{\phi}_{1}$ and $\hat{\bm{\phi}}_{1}=\bm{\phi}_{2}$ , the measurement $\|\bm{\phi}_{1}-\hat{\bm{\phi}}_{1}\|$ would indicate a large error for the optimal estimation $\hat{\bm{\phi}}_{1}=\bm{\phi}_{2}$ .

In addition, this measurement can be interpreted as the performance of learning the matrix $\tilde{\mathbf{B}}$ using $\hat{{\mathbb{\Phi}}}_{k}$ by low-rank approximation, since

[TABLE]

Moreover, it is shown in [13] that $\left\|\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\|_{\mathrm{F}}^{2}$ is related to the softmax regression loss, and is called H-score therein, which provides the operational meaning for our selected performance measurement.

Note that $\left\|\tilde{\mathbf{B}}{\mathbb{\Phi}}_{k}\right\|_{\mathrm{F}}^{2}-\left\|\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\|_{\mathrm{F}}^{2}\geq 0$ , for all $|{\mathcal{X}}|\times k$ matrices $\hat{{\mathbb{\Phi}}}_{k}$ satisfying $\hat{{\mathbb{\Phi}}}_{k}^{\mathrm{T}}\hat{{\mathbb{\Phi}}}_{k}=\mathbf{I}_{k}$ , where $\mathbf{I}_{k}$ is the $k\times k$ identity matrix. In this paper, our goal is to characterize the sample complexity in learning maximal correlation functions for large datasets, i.e., the minimum number of samples required such that with high probability, the learning error $\left\|\tilde{\mathbf{B}}{\mathbb{\Phi}}_{k}\right\|_{\mathrm{F}}^{2}-\left\|\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\|_{\mathrm{F}}^{2}$ is small [21]. To this end, we first consider the related error exponent $\mathrm{E}_{k}$ defined as444Throughout, all logarithms are base $e$ , i.e., natural.

[TABLE]

where the probability is measured over the i.i.d. sampling process from $P_{XY}$ . In particular, the first limit in (10) indicates the asymptotic regime of $n$ we are interested in, since in large datasets the number of i.i.d. samples $n$ can be sufficiently large. Then, the second limit of $\epsilon$ is naturally from that in this asymptotic regime, the empirical distribution converges to the true distribution with high probability, and thus the learning error is small with high probability; see, e.g., [22, Proposition 4.6] or [18, Proposition 47] for a more rigorous characterization.

In the remaining sections, we will show that these two limits facilitate the derivation of the analytical solution for the error exponent (10), and further use this exponent to provide the upper bounds for the sample complexity on large datasets, where the test error $\epsilon$ is small. In order to establish the analytical solution of (10), in the next section, we develop a mathematical framework for computing the learning error $\left\|\tilde{\mathbf{B}}{\mathbb{\Phi}}_{k}\right\|_{\mathrm{F}}^{2}-\left\|\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\|_{\mathrm{F}}^{2}$ for the empirical distributions $\hat{P}_{XY}$ close to the true distribution $P_{XY}$ .

III The Matrix Perturbation Analyses

Suppose that $\mathbf{A}\in\mathbb{R}^{d\times d}$ is a symmetric matrix with the eigenvectors $\bm{v}_{1},\dots,\bm{v}_{d}$ and the eigenvalues $\lambda_{1}\geq\dots\geq\lambda_{d}$ . In addition, we denote

[TABLE]

as the matrix formed by the top $k$ eigenvectors of $\mathbf{A}$ . Then, it follows that

[TABLE]

where $\operatorname*{tr}\{\cdot\}$ denotes the trace of the matrix. Now, suppose that $\mathbf{A}(\tau)$ is a family of symmetric matrices parametrized by $\tau$ with $\mathbf{A}(0)=\mathbf{A}$ , and is an analytic function of $\tau$ with the Taylor series expansion

[TABLE]

where $\mathbf{A}^{\prime}=\mathbf{A}^{\prime}(0)$ is the first-order derivative of $\mathbf{A}(\tau)$ with respect to $\tau$ at $\tau=0$ . In addition, let $\mathbf{V}_{k}(\tau)\in\mathbb{R}^{d\times k}$ be the matrix formed by the top $k$ eigenvectors of $\mathbf{A}(\tau)$ defined similarly to (11). Then, when $\lambda_{k}>\lambda_{k+1}$ , the following lemma characterizes the second-order Taylor series expansion of $\operatorname*{tr}\left\{\mathbf{V}_{k}^{\mathrm{T}}(\tau)\mathbf{A}\mathbf{V}_{k}(\tau)\right\}$ with respect to $\tau$ .

Lemma 1.

Suppose that $\lambda_{k}>\lambda_{k+1}$ , then

[TABLE]

where $\operatorname*{tr}\{\cdot\}$ denotes the trace of the matrix.

Proof.

See Appendix C. ∎

Moreover, for the case $\lambda_{k}=\lambda_{k+1}$ , we apply the notation $[d]\triangleq\{1,\dots,d\}$ , and define the indices set $\mathcal{I}_{k}\triangleq\{i\in[d]\colon\lambda_{i}=\lambda_{k}\}$ , and the complement set $\mathcal{I}^{\mathsf{c}}_{k}\triangleq[d]\setminus\mathcal{I}_{k}=\{i\in[d]\colon\lambda_{i}\neq\lambda_{k}\}$ . Furthermore, we define the matrix $\mathbf{V}_{\mathcal{I}_{k}}\triangleq[\bm{v}_{i},i\in\mathcal{I}_{k}]\in\mathbb{R}^{d\times|\mathcal{I}_{k}|}$ as the submatrix of $\mathbf{V}$ composed of the columns of $\mathbf{V}$ with indices in $\mathcal{I}_{k}$ . Then, the following lemma establishes the second-order Taylor series expansion of $\operatorname*{tr}\left\{\mathbf{V}_{k}^{\mathrm{T}}(\tau)\mathbf{A}\mathbf{V}_{k}(\tau)\right\}$ for the case $\lambda_{k}=\lambda_{k+1}$ .

Lemma 2.

Suppose that $\lambda_{k}=\lambda_{k+1}$ , then

[TABLE]

where $l$ is the minimal element of $\mathcal{I}_{k}$ , and

[TABLE]

where $\bm{u}_{1},\dots,\bm{u}_{k-l+1}\in\mathbb{R}^{|\mathcal{I}_{k}|}$ are the top $k-l+1$ eigenvectors of $\mathbf{V}_{\mathcal{I}_{k}}^{\mathrm{T}}\mathbf{A}^{\prime}\mathbf{V}_{\mathcal{I}_{k}}$ .

Proof.

See Appendix D. ∎

Note that since the Frobenius norm $\left\|\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\|_{\mathrm{F}}^{2}=\operatorname*{tr}\left\{\hat{{\mathbb{\Phi}}}_{k}^{\mathrm{T}}\tilde{\mathbf{B}}^{\mathrm{T}}\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\}$ , the results we developed in this section essentially characterize the difference between $\left\|\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\|^{2}_{\mathrm{F}}$ and $\left\|\tilde{\mathbf{B}}{\mathbb{\Phi}}_{k}\right\|^{2}_{\mathrm{F}}$ with respect to the perturbations on $\tilde{\mathbf{B}}$ due to the difference between $P_{XY}$ and $\hat{P}_{XY}$ . These results will be useful when we derive the error exponent (10) in the rest of this paper.

IV The Supervised Learning

Given $n$ training samples $(x_{1},y_{1}),\ldots,(x_{n},y_{n})$ , i.i.d. generated from the joint distribution $P_{XY}$ , in this section we develop the error exponent (10) and an upper bound for sample complexity for large datasets, in both cases $\sigma_{k}>\sigma_{k+1}$ and $\sigma_{k}=\sigma_{k+1}$ , where $\sigma_{k}$ and $\sigma_{k+1}$ are the $k$ -th and $(k+1)$ -th largest singular values of $\tilde{\mathbf{B}}$ .

IV-A The Sample Complexity for the Case $\sigma_{k}>\sigma_{k+1}$

To delineate our results, we need the following definitions. First, we define the quantity $\alpha_{k}$ for the given $P_{XY}$ , which will be useful in characterizing the error exponent (10).

Definition 1.

Given a joint distribution $P_{XY}$ and $k\in\mathbb{N}^{+}$ , we define the matrix $\mathbf{G}_{k}$ as

[TABLE]

where $d\triangleq|{\mathcal{X}}|$ , and $\sigma_{i}$ denotes the $i$ -th singular value555We define $\sigma_{i}=0$ for $i>|{\mathcal{Y}}|$ , if $|{\mathcal{X}}|>|{\mathcal{Y}}|$ . of $\tilde{\mathbf{B}}$ . In addition, $\mathbf{L}$ is an $(|{\mathcal{X}}|\cdot|{\mathcal{Y}}|)\times(|{\mathcal{X}}|\cdot|{\mathcal{Y}}|)$ matrix, whose entry at the $[(x-1)|{\mathcal{Y}}|+y]$ -th row and $[(x^{\prime}-1)|{\mathcal{Y}}|+y^{\prime}]$ -th column is defined as

[TABLE]

where $\delta_{ij}$ is the Kronecker delta, and

[TABLE]

where “ $\otimes$ ” denotes the Kronecker product, and $\bm{\phi}_{i}$ represents the $i$ -th right singular vector of $\tilde{\mathbf{B}}$ . Then, $\alpha_{k}$ is defined as the spectral norm of $\mathbf{G}_{k}$ .

Then, the error exponent $\mathrm{E}_{k}$ as defined in (10) can be established as follows, whose proof will later be provided.

Theorem 1.

If $\sigma_{k}>\sigma_{k+1}$ , then the error exponent $\mathrm{E}_{k}$ as defined in (10) is

[TABLE]

Then, the following result illustrates that, for large datasets where the learning error is typically small, an upper bound of sample complexity can be established using the error exponent $\mathrm{E}_{k}$ .

Theorem 2.

For given $P_{XY}$ , there exist an absolute positive constant $\epsilon_{0}>0$ that depends only on $P_{XY}$ , such that for all $\epsilon\in(0,\epsilon_{0})$ and $\delta\in(0,1)$ , we have

[TABLE]

for all $n>N^{(4\alpha_{k})}(\epsilon,\delta)$ , where we have defined

[TABLE]

and where $\alpha_{k}$ is as defined in Definition 1.

Proof.

See Appendix G. ∎

From Theorem 2, to guarantee that the error in learning maximal correlation functions is within some small $\epsilon$ with probability at least $1-\delta$ , it suffices to use $n=O\left(\frac{\alpha_{k}}{\epsilon}\log\frac{\alpha_{k}}{\epsilon\delta}\right)=O\left(\frac{1}{\epsilon\mathrm{E}_{k}}\log\frac{1}{\delta\epsilon\mathrm{E}_{k}}\right)$ samples.

Remark 1.

For comparison, an upper bound of sample complexity $n=O(\frac{1}{\epsilon^{2}}\log\frac{1}{\delta})$ was provided in [22, Proposition 4.6] and [18, Proposition 47]. In particular, this upper bound is obtained via analyzing the concentration properties of $\hat{\mathbf{B}}$ , with the assumption that the true marginal distributions $P_{X}$ and $P_{Y}$ have been known in advance. While our analysis does not rely on such assumptions, the resulting upper bound of sample complexity for large datasets can be significantly tighter.

When we are interested in learning the entire correlation structure between $X,Y$ , i.e., $k=d-1$ , the following proposition establishes a simple closed form solution of the error exponent.

Proposition 1.

If $d=|{\mathcal{X}}|\leq|{\mathcal{Y}}|$ , $\sigma_{d-1}>0$ , and $k=d-1$ , then we have $\alpha_{k}=\frac{\sigma_{1}^{2}}{4}$ and

[TABLE]

Proof.

See Appendix H. ∎

We then introduce the proof of Theorem 1, which will make use of the perturbation analyses established in Section III. To begin, we first define the following sets of empirical distributions.

Definition 2.

For all $\epsilon>0$ , the set $\mathcal{S}_{1}(\epsilon)$ is defined as

[TABLE]

where $\hat{{\mathbb{\Phi}}}_{k}$ corresponds to the top $k$ right singular vectors of $\hat{\mathbf{B}}$ , as defined in (5) and (7). Moreover, the set $\mathcal{N}(\epsilon)$ is defined as

[TABLE]

Furthermore, to characterize the empirical distributions $\hat{P}_{XY}\in\mathcal{N}(\epsilon)$ , we denote the difference between $\hat{P}_{XY}$ and the true distribution $P_{XY}$ as

[TABLE]

which induces a one-to-one correspondence666Note that since $D(\hat{P}_{XY}\|P_{XY})$ is finite, we have $\hat{P}_{XY}(x,y)=0$ for each $(x,y)$ with $P_{XY}(x,y)=0$ . Therefore, we can obtain the distribution $\hat{P}_{XY}$ from $\Gamma$ , via

$\displaystyle\hat{P}_{XY}(x,y)=P_{XY}(x,y)+\sqrt{\epsilon P_{XY}(x,y)}\Gamma(y,x),\text{~{}for all~{}}(x,y)\in{\mathcal{X}}\times{\mathcal{Y}}.$

$\hat{P}_{XY}\leftrightarrow\Gamma$ . We also define the $|{\mathcal{Y}}|\times|{\mathcal{X}}|$ matrices $\bm{\Gamma}$ and $\mathbf{\Xi}$ , with entries at the $y$ -th row and $x$ -th column being $\Gamma(y,x)$ and

[TABLE]

respectively. Then, using Lemma 1, the matrix $\hat{\mathbf{B}}$ estimated from data samples with the empirical distribution in $\mathcal{N}(\epsilon)$ can be represented in a perturbation form illustrated as follows.

Lemma 3.

For given $P_{XY}$ , there exists a constant $C>0$ , such that for all $\epsilon>0$ and $\hat{P}_{XY}\in\mathcal{N}(\epsilon)$ , we have $\left\|\bm{\Xi}\right\|_{\mathrm{F}}\leq C$ and

[TABLE]

Proof.

See Appendix E. ∎

Moreover, the following lemma characterizes the I-projection of $P_{XY}$ onto the set $\mathcal{S}_{1}(\epsilon)\cap\mathcal{N}(\epsilon)$ , which will be useful for characterizing the error exponent.

Lemma 4.

*For $\mathcal{S}_{1}(\epsilon)$ and $\mathcal{N}(\epsilon)$ as defined in Definition 2, we have777Given a distribution $P$ , we adopt the notation [23, p. 431]

$\displaystyle D(\mathcal{S}\|P)\triangleq\inf_{Q\in\mathcal{S}}D(Q\|P),$

where $\mathcal{S}$ is a set of distributions.*

[TABLE]

Proof.

See Appendix F. ∎

Based on the above lemmas, Theorem 1 can be established as follows.

Proof of Theorem 1.

First, it follows from Sanov’s theorem [23] that

[TABLE]

where “ $\doteq$ ” is the conventional dot-equal notation.888In particular, we use $a_{n}\doteq\exp(nb)$ to denote

$\displaystyle\lim_{n\to\infty}\frac{1}{n}\log a_{n}=b.$

Therefore, the error exponent (16) can be expressed as

[TABLE]

From Lemma 4, there exists an $\epsilon_{0}>0$ such that, for all $\epsilon\in(0,\epsilon_{0})$ ,

[TABLE]

In addition, note that from (20), for all $\hat{P}_{XY}\in\mathcal{S}_{1}(\epsilon)\setminus\mathcal{N}(\epsilon)$ we have $D\bigl{(}\hat{P}_{XY}\big{\|}P_{XY}\bigr{)}>\frac{\epsilon}{\alpha_{k}}$ . Hence, for all $\epsilon\in(0,\epsilon_{0})$ we have

[TABLE]

which implies that

[TABLE]

Combining (26) and (27), we obtain (16). ∎

IV-B The Sample Complexity for the Case $\sigma_{k}=\sigma_{k+1}$

The idea of deriving the sample complexity in this case is similar to the case $\sigma_{k}>\sigma_{k+1}$ . To delineate the result, we first define

[TABLE]

Similar to $\mathbf{G}_{k}$ and $\alpha_{k}$ defined in Section IV-A, for the case $\sigma_{k}=\sigma_{k+1}$ we define the matrix $\mathbf{J}_{k}$ and $\beta_{k}$ to characterize the error exponent $\mathrm{E}_{k}$ .

Definition 3.

Given $\mathbf{\Gamma}\in\mathbb{R}^{|{\mathcal{Y}}|\times|{\mathcal{X}}|}$ , the matrix $\mathbf{J}_{k}(\mathbf{\Gamma})\in\mathbb{R}^{(|{\mathcal{X}}||{\mathcal{Y}}|)\times(|{\mathcal{X}}||{\mathcal{Y}}|)}$ is defined as

[TABLE]

where $\mathbf{G}_{l-1}$ and $\mathbf{L}$ are as defined in (13)–(14), where $\mathcal{I}^{\mathsf{c}}_{k}\triangleq[d]\setminus\mathcal{I}_{k}$ , and ${\bm{\vartheta}}_{ij}$ are defined as, for all $i,j$ , ${\bm{\vartheta}}_{ij}\triangleq\bm{\phi}_{j}\otimes\bigl{(}\tilde{\mathbf{B}}\bm{\varphi}_{i}\bigr{)}+\bm{\varphi}_{i}\otimes\bigl{(}\tilde{\mathbf{B}}\bm{\phi}_{j}\bigr{)}$ , where $\bm{\varphi}_{i}$ are defined as

[TABLE]

where $\bm{u}_{1},\dots,\bm{u}_{k-l+1}\in\mathbb{R}^{|\mathcal{I}_{k}|}$ are the top $k-l+1$ eigenvectors of the matrix ${\mathbb{\Phi}}_{\mathcal{I}_{k}}^{\mathrm{T}}\left(\tilde{\mathbf{B}}^{\mathrm{T}}\bm{\Xi}+\bm{\Xi}^{\mathrm{T}}\tilde{\mathbf{B}}\right){\mathbb{\Phi}}_{\mathcal{I}_{k}}$ , and $\bm{\Xi}$ is as defined in (22). In addition, $\beta_{k}$ is defined as the optimal value of the optimization problem999Here, we apply the vectorization operation $\operatorname{vec}(\cdot)$ to stack all columns of a matrix into a vector. Specifically, for $\mathbf{W}=[w_{ij}]\in\mathbb{R}^{p\times q}$ , $\operatorname{vec}(\mathbf{W})$ is a $pq$ -dimensional column vector with the $[p(j-1)+i]$ -th entry being $w_{ij}$ .

[TABLE]

Then, the following theorem characterizes the error exponent for the general case where $\sigma_{k}$ and $\sigma_{k+1}$ can be equal, and the corresponding upper bound of sample complexity can be established similar to Theorem 2.

Theorem 3.

If $\sigma_{k}=\sigma_{k+1}$ , the error exponent $\mathrm{E}_{k}$ as defined in (10) is

[TABLE]

Proof.

See Appendix I. ∎

Note that $\mathbf{J}_{k}(\bm{\Gamma})$ in (29) is dependent on $\mathbf{\Gamma}$ , since ${\bm{\vartheta}}_{ij}$ is dependent on $\bm{\Xi}$ . Therefore, unlike Theorem 1, the optimal value of (127) is not simply the largest singular value of some given matrix, and the optimization problem (127) is in general neither convex nor concave. However, note that if $\mathbf{J}_{k}$ is fixed, the optimization problem (127) is reduced to solving the largest singular value of $\mathbf{J}_{k}$ . Therefore, we can first fix $\mathbf{\Gamma}$ to compute (or update) $\mathbf{J}_{k}$ , and then solve the largest singular vector of $\mathbf{J}_{k}$ to update $\mathbf{\Gamma}$ , and so forth. This iterative procedure, summarized in Algorithm 2, solves the local optimum of the optimization problem (127). In particular, to update $\mathbf{\Gamma}$ (cf. line 13–16 of Algorithm 2), we project $\operatorname{vec}\left(\mathbf{\Gamma}\right)$ onto the eigenspace of $\mathbf{J}_{k}$ associated with its largest singular value, where a learning rate $\eta$ is used to enhance the robustness of the update.

While there is in general no closed form solution for (127), for some special joint distributions the closed form solutions exist.

Corollary 1.

Suppose $d=|{\mathcal{X}}|\leq|{\mathcal{Y}}|$ , and the joint distribution $P_{XY}(x,y)$ takes the form

[TABLE]

where the probabilities $p_{1}$ and $p_{2}$ satisfy $p_{1}\neq p_{2}$ and $d[p_{1}+(|{\mathcal{Y}}|-1)p_{2}]=1$ . Then for any dimension $k\in[d-1]$ , we have $\beta_{k}=\frac{\sigma_{1}^{2}}{4}$ and thus

[TABLE]

where

[TABLE]

are the none-zero singular values of the corresponding $\tilde{\mathbf{B}}$ .

Proof.

See Appendix J. ∎

IV-C Remarks on the General Trend of Error Exponent

In machine learning problems, it is also interesting to investigate $\bigl{\|}\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\bigr{\|}_{\mathrm{F}}^{2}/\bigl{\|}\tilde{\mathbf{B}}{\mathbb{\Phi}}_{k}\bigr{\|}_{\mathrm{F}}^{2}$ , which tells how effective $\hat{{\mathbb{\Phi}}}_{k}$ is, compared to ${\mathbb{\Phi}}_{k}$ . In particular, this is studied by the asymptotic problem

[TABLE]

where $\sigma_{1},\ldots,\sigma_{k}$ are the top $k$ singular values of $\tilde{\mathbf{B}}$ .

The error exponent (34) combining with Theorem 2 offers insights on designing the dimensionality $k$ to effectively extract the correlation structure among different data variables from a given set of training samples. In particular, since both $X$ and $Y$ are discrete, the true distribution $P_{XY}$ can be approximated by the empirical distribution $\hat{P}_{XY}$ of training samples, and then the normalized error exponent $\hat{\mathrm{E}}_{k}$ can be obtained via computing the corresponding $\alpha_{k}$ or $\beta_{k}$ from $\hat{P}_{XY}$ .

However, in real algorithm designs, it is more useful to provide a general trend for error exponent over different $k$ . For certain symmetric joint distributions, it can be verified that the normalized error exponent $\mathrm{E}_{k}$ is linear to $k$ . For example, consider the joint distribution $P_{XY}$ as constructed in Corollary 1, then we have $\hat{\mathrm{E}}_{k}=\left(\sum_{i=1}^{k}\sigma_{i}^{2}\right)\mathrm{E}_{k}=k\sigma_{1}^{2}\mathrm{E}_{k}=2k$ . However, one can easily construct examples that the error exponent is not monotonic with respect to $k$ , and the behavior is generally complicated.

To gain more insights, we uniformly sample101010In particular, we generate independent random numbers uniformly from $[0,1]$ for each entry of $P_{XY}$ , and then normalize the sum of the entries to $1$ . the joint distribution $P_{XY}$ from the distribution space of $X,Y$ , and consider the empirical average of the error exponents (34) over the sampled joint distributions. Fig. 1 plots the empirical average of the error exponents over $10^{5}$ sampled joint distributions with $|{\mathcal{X}}|=12$ and $|{\mathcal{Y}}|=10$ , from which we observe that the error exponent grows linear to $k$ , for small $k$ , and becomes super-linear when $k$ is large. Although we do not have a rigorous proof in this paper, this general trend of the normalized error exponent, combining with Theorem 2, may provide a practical design guidance for selecting the dimensionality of the maximal correlation functions in real problems.

V The Semi-supervised Learning

In the semi-supervised learning setup, in addition to the $n$ labeled samples $(x_{1},y_{1}),\dots,(x_{n},y_{n})$ , we also have $m=nr$ unlabeled samples $x_{n+1},\ldots,x_{n+m}$ , where $r$ is the ratio between the labeled and unlabeled samples, and the unlabeled samples are assumed to be i.i.d. sampled from the marginal distribution $P_{X}$ and independent of the labeled samples. In order to estimate the maximal correlation functions from both the labeled and unlabeled samples, we denote the empirical distribution of the unlabeled samples $x_{n+1},\ldots,x_{n+m}$ as $Q_{X}$ , and again apply $\hat{P}_{XY}$ and $\hat{P}_{Y|X}$ to denote the empirical joint and conditional distributions of the labeled samples. Moreover, we denote the empirical distribution for $x_{1},\ldots,x_{n+m}$ as $\bar{P}_{X}$ , which is the empirical marginal distribution of $X$ over all samples, and can be expressed as

[TABLE]

Then, Algorithm 1 can be generalized to estimating the maximal correlation functions from both labeled and unlabeled samples. For this purpose, we define

[TABLE]

as the empirical joint distribution by including both the labeled and unlabeled samples, with the corresponding marginal distributions being $\bar{P}_{X}$ and

[TABLE]

respectively. Similarly, we obtain the conditional distributions

[TABLE]

and

[TABLE]

Then, we can generalize Algorithm 1 to semi-supervised learning by replacing the all expectation operations taken over empirical distributions $\hat{P}_{X},\hat{P}_{Y},\hat{P}_{XY},\hat{P}_{X|Y},\hat{P}_{Y|X}$ with the expectations taken over empirical distributions $\bar{P}_{X},\bar{P}_{Y},\bar{P}_{XY},\bar{P}_{X|Y},\bar{P}_{Y|X}$ , respectively. In particular, let $\bar{f}\colon{\mathcal{X}}\mapsto\mathbb{R}^{k}$ and $\bar{g}\colon{\mathcal{Y}}\mapsto\mathbb{R}^{k}$ denote the initially chosen functions of $X$ and $Y$ , which are zero-mean over the distribution $\bar{P}_{X}$ and $\bar{P}_{Y}$ , respectively. Then, the alternating conditional expectation operations (cf. line 13–16 of Algorithm 1) can be represented as:

[TABLE]

where $\mathbf{\Lambda}_{\bar{f}}$ and $\mathbf{\Lambda}_{\bar{g}}$ are the covariance matrices of $\bar{f}$ and $\bar{g}$ , defined as

[TABLE]

The idea of this generalization is that the unlabeled samples do not help the estimation of the conditional distribution $P_{Y|X}$ , but improve the estimation of the marginal distribution $P_{X}$ . Therefore, the first step in the algorithm remains the same, while in the second step, the improved empirical marginal distribution $\bar{P}_{X}$ is applied to improve the estimation of the conditional distribution $P_{X|Y}$ . In practice, we may assume that the marginal distribution is much easier to estimate than the joint distribution [8], and hence the above generalized ACE algorithm can still be implemented for computing the maximal correlation functions from training samples. In this section, our goal is to characterize the corresponding error exponent for the generalized ACE algorithm (38) in the semi-supervised learning scenario.

To this end, let us define the $|{\mathcal{Y}}|\times|{\mathcal{X}}|$ matrix $\bar{\mathbf{B}}$ , with entries

[TABLE]

where $\bar{P}_{Y}$ is the marginal distribution of $\bar{P}_{XY}$ as defined in (37). Then, it is shown in Appendix K that the algorithm (38) essentially computes the top $k$ singular vectors of $\bar{\mathbf{B}}$ . In addition, we denote $\bar{{\mathbb{\Phi}}}_{k}$ as the $|{\mathcal{X}}|\times k$ dimensional matrix with the $i$ -th column being the $i$ -th right singular vector of $\bar{\mathbf{B}}$ . Then for large datasets, the sample complexity of estimating the maximal correlation functions by the algorithm (38) can be characterized by investigating the error exponent

[TABLE]

where the probability is measured over $n$ i.i.d. samples from $P_{XY}$ and the $m=nr$ i.i.d samples from $P_{X}$ . In the following, we develop the error exponent (40) for the semi-supervised learning, for both cases $\sigma_{k}>\sigma_{k+1}$ and $\sigma_{k}=\sigma_{k+1}$ , where $\sigma_{k}$ and $\sigma_{k+1}$ are the $k$ -th and $(k+1)$ -th largest singular values of $\tilde{\mathbf{B}}$ .

V-A The Sample Complexity for the Case $\sigma_{k}>\sigma_{k+1}$

Similar to Section IV-A, we first define the matrix $\bar{\mathbf{G}}_{k}(r)$ and the quantity $\bar{\alpha}(r)$ , which will be useful in characterizing the exponent $\bar{\mathrm{E}}_{k}$ .

Definition 4.

For given $r\geq 0$ , the matrix $\bar{\mathbf{G}}_{k}(r)$ is defined as

[TABLE]

where $\mathbf{G}_{k}$ is as defined in (13), and $\bar{\mathbf{L}}(r)$ is an $(|{\mathcal{X}}|\cdot|{\mathcal{Y}}|)\times[|{\mathcal{X}}|(|{\mathcal{Y}}|+1)]$ matrix with its entry at the $[(x-1)|{\mathcal{Y}}|+y]$ -th row and $[(x^{\prime}-1)|{\mathcal{Y}}|+y^{\prime}]$ -th column defined as

[TABLE]

for all $x,x^{\prime}\in\{1,\dots,|{\mathcal{X}}|\}$ and $y,y^{\prime}\in\{1,\dots,|{\mathcal{Y}}|\}$ , where $\delta_{ij}$ denotes the Kronecker delta. Then, $\bar{\alpha}_{k}(r)$ is defined as the spectral norm of the matrix $\bar{\mathbf{G}}_{k}(r)$ .

Then, the following theorem establishes the analytical form of the error exponent $\bar{\mathrm{E}}_{k}$ , whose proof will be later provided.

Theorem 4.

If $\sigma_{k}>\sigma_{k+1}$ , the error exponent $\bar{\mathrm{E}}_{k}(r)$ as defined in (40) is

[TABLE]

Similar to the case of supervised learning, for semi-supervised learning we have the following result establishing the upper bound for the sample complexity of learning maximal correlation functions on large datasets.

Theorem 5.

For given $P_{XY}$ and $r>0$ , there exists an absolute positive constant and $\bar{\epsilon}_{0}$ that depends only on $P_{XY}$ and $r$ , such that for all $\epsilon\in(0,\bar{\epsilon}_{0})$ and $\delta\in(0,1)$ , we have

[TABLE]

for all $n>\bar{N}^{(4\bar{\alpha}_{k}(r))}(\epsilon,\delta,r)$ , where we have defined

[TABLE]

and where $\bar{\alpha}_{k}(r)$ is as defined in Definition 4.

Proof.

See Appendix O. ∎

Furthermore, the performance gain of estimating the maximal correlation functions with the aids of the unlabeled samples can be characterized by the following proposition.

Proposition 2.

For all $r\geq 0$ , the $\bar{\alpha}_{k}(r)$ as defined in Definition 4 is a non-increasing and convex function of $r$ , and satisfies

[TABLE]

where $\bar{\alpha}_{k}(\infty)$ is defined as111111The limit exists since $\bar{\alpha}_{k}(r)$ is non-increasing and has a lower bound [math]. $\bar{\alpha}_{k}(\infty)\triangleq\lim_{r\to+\infty}\bar{\alpha}_{k}(r),$ and we have $\bar{\alpha}_{k}(0)=\alpha_{k}$ with $\alpha_{k}$ as defined in Definition 1.

Proof.

See Appendix P. ∎

From Proposition 2, the error exponent $\bar{\mathrm{E}}_{k}(r)=[2\bar{\alpha}_{k}(r)]^{-1}$ of semi-supervised learning is a non-decreasing function of $r$ . Thus, with more unlabeled data samples used to train maximal correlation functions, we can obtain better performance. Moreover, it follows immediately from the first inequality of (46) that

[TABLE]

where $(1+r)\mathrm{E}_{k}$ can be interpreted as the error exponent in the case where we replace all $nr$ unlabeled data samples with labeled data samples and obtain $n(1+r)$ labeled samples of $(X,Y)$ . Therefore, the upper bound (47) simply implies that the labeled data is generally more useful in estimating the maximal correlation functions. However, this upper bound is achievable for certain cases, where the unlabeled data can be as useful as the labeled data, as illustrated in the following proposition (cf. Proposition 1).

Proposition 3.

If $d=|{\mathcal{X}}|\leq|{\mathcal{Y}}|$ , $\sigma_{d-1}>0$ , and $k=d-1$ , then we have $\bar{\alpha}_{k}(r)=\frac{\alpha_{k}}{1+r}=\frac{\sigma_{1}^{2}}{4(1+r)}$ , and thus

[TABLE]

Proof.

See Appendix Q. ∎

In Proposition 1 and Proposition 3, we are interested in learning the entire correlation structure between $X$ and $Y$ , i.e., $k=d-1$ . In such cases, learning top $d-1$ singular vectors ${\mathbb{\Phi}}_{k}=[\bm{\phi}_{1},\dots,\bm{\phi}_{d-1}]$ is equivalent to learning the last singular vector

[TABLE]

which depends only on the marginal distribution $P_{X}$ . Hence, the unlabeled data samples of $X$ is as useful as the labeled data samples of $(X,Y)$ , and thus we can achieve the upper bound of (47).

We then introduce the proof of Theorem 1, which will again make use of the perturbation analyses established in Section III. To start, we define the sets of the joint distributions $\bar{P}_{XY}$ as follows.

Definition 5.

For all $\epsilon>0$ , the set $\bar{\mathcal{S}}_{1}(\epsilon)$ is defined as

[TABLE]

where $\bar{{\mathbb{\Phi}}}_{k}$ corresponds to the top $k$ right singular vectors of $\bar{\mathbf{B}}$ as defined in (39). Moreover, the set $\bar{\mathcal{N}}(\epsilon)$ is defined as

[TABLE]

where for given $\hat{P}_{XY}$ and $Q_{X}$ , the joint distribution $\bar{P}_{XY}$ is as defined in (36).

Furthermore, for each $\bar{P}_{XY}\in\bar{\mathcal{N}}(\epsilon)$ with the corresponding empirical distributions $\hat{P}_{XY}$ and $Q_{X}$ , we introduce the one-to-one correspondences $\hat{P}_{XY}\leftrightarrow\Gamma$ and $Q_{X}\leftrightarrow\zeta$ , where $\Gamma(y,x)$ is as defined in (21), and where, similarly, we have defined

[TABLE]

Moreover, we define $\bm{\zeta}$ as the $|{\mathcal{X}}|$ -dimensional vector with the $x$ -th entry being $\zeta(x)$ , and define the $|{\mathcal{Y}}|\times|{\mathcal{X}}|$ matrix $\bar{\bm{\Xi}}$ with the entries $\bar{\Xi}(y,x)$ being

[TABLE]

where we have defined

[TABLE]

Then, similar to Lemma 3, the matrix $\bar{\mathbf{B}}$ estimated from data samples can also be represented in a perturbation form, as the following lemma expresses.

Lemma 5.

For given $P_{XY}$ and $r\geq 0$ , there exists a constant $\bar{C}>0$ , such that for all $\epsilon>0$ and $\bar{P}_{XY}\in\bar{\mathcal{N}}(\epsilon)$ , we have $\left\|\bar{\bm{\Xi}}\right\|_{\mathrm{F}}\leq\bar{C}$ and

[TABLE]

Proof.

See Appendix L. ∎

In addition, the following lemma characterizing the error exponent will be useful in our analysis, and can be obtained using Sanov’s theorem.

Lemma 6.

Given $\epsilon>0$ , we have

[TABLE]

where $\bar{\mathcal{S}}_{1}(\epsilon)$ is as defined in (48).

Proof.

See Appendix M. ∎

From (54), for given $\epsilon>0$ , the error exponent is determined by the infimum of a weighted sum of K-L divergences. Furthermore, if we restrict our attention to the distributions $\bar{P}_{XY}\in\bar{\mathcal{N}}(\epsilon)$ , the following result provides a characterization of this infimum in the small $\epsilon$ regime, and will also be useful in our analysis.

Lemma 7.

For $\bar{\mathcal{S}}_{1}(\epsilon)$ and $\bar{\mathcal{N}}(\epsilon)$ as defined in Definition 5, we have

[TABLE]

Proof.

See Appendix N. ∎

Using Lemma 6 and Lemma 7, Theorem 4 can be established as follows.

Proof of Theorem 4.

From Lemma 6, the error exponent $\bar{\mathrm{E}}_{k}(r)$ can be expressed as

[TABLE]

Moreover, from Lemma 7, there exists an $\epsilon_{0}>0$ such that for all $\epsilon\in(0,\epsilon_{0})$ , we have

[TABLE]

In addition, note that for all $\bar{P}_{XY}\in\bar{\mathcal{S}}_{1}(\epsilon)\setminus\bar{\mathcal{N}}(\epsilon)$ we have $\left[D\bigl{(}\hat{P}_{XY}\big{\|}P_{XY}\bigr{)}+rD(Q_{X}\|P_{X})\right]>\frac{\epsilon}{\bar{\alpha}_{k}(r)}$ . Hence, for all $\epsilon\in(0,\epsilon_{0})$ we have

[TABLE]

which implies that

[TABLE]

Combining (56) and (57), we obtain (43). ∎

V-B The Sample Complexity for the Case $\sigma_{k}=\sigma_{k+1}$

With ${\mathcal{I}_{k}}$ , $l$ , and ${\mathbb{\Phi}}_{\mathcal{I}_{k}}$ as defined in (28), we further introduce the quantity $\bar{\beta}(r)$ as follows.

Definition 6.

Given $r\geq 0$ , $\bm{\Gamma}\in\mathbb{R}^{|{\mathcal{Y}}|\times|{\mathcal{X}}|}$ , and $\bm{\zeta}\in\mathbb{R}^{|{\mathcal{X}}|}$ , the matrix $\bar{\mathbf{J}}_{k}(r,\bm{\Gamma},\bm{\zeta})$ is defined as

[TABLE]

where $\bar{\mathbf{G}}_{l-1}$ and $\bar{\mathbf{L}}(r)$ are as defined in (41)–(42), and $\mathbf{L}$ is as defined in (14) and $\bar{{\bm{\vartheta}}}_{ij}$ are defined as, for all $i,j$ , $\bar{{\bm{\vartheta}}}_{ij}\triangleq\bm{\phi}_{j}\otimes\bigl{(}\tilde{\mathbf{B}}\bar{\bm{\varphi}}_{i}\bigr{)}+\bar{\bm{\varphi}}_{i}\otimes\bigl{(}\tilde{\mathbf{B}}\bm{\phi}_{j}\bigr{)}$ , where $\bar{\bm{\varphi}}_{i}$ are defined as

[TABLE]

and where $\bar{\bm{u}}_{1},\dots,\bar{\bm{u}}_{k-l+1}\in\mathbb{R}^{|\mathcal{I}_{k}|}$ are the top $k-l+1$ eigenvectors of the matrix ${\mathbb{\Phi}}_{\mathcal{I}_{k}}^{\mathrm{T}}\left(\tilde{\mathbf{B}}^{\mathrm{T}}\bar{\bm{\Xi}}+\bar{\bm{\Xi}}^{\mathrm{T}}\tilde{\mathbf{B}}\right){\mathbb{\Phi}}_{\mathcal{I}_{k}}$ . Then, $\bar{\beta}_{k}(r)$ is defined as the optimal value of the optimization problem

[TABLE]

where $\bm{\varsigma}\in\mathbb{R}^{|{\mathcal{X}}|(|{\mathcal{Y}}|+1)}$ is defined as

[TABLE]

Then we have the following result characterizing the error exponent (40), and the corresponding upper bound of sample complexity for large datasets can be established similar to Theorem 5.

Theorem 6.

If $\sigma_{k}=\sigma_{k+1}$ , the error exponent $\bar{\mathrm{E}}_{k}(r)$ as defined in (40) is

[TABLE]

Proof.

See Appendix R. ∎

Note that $\bar{\mathbf{J}}_{k}$ in (58) is dependent on $\bm{\varsigma}$ , since $\bar{{\bm{\vartheta}}}_{ij}$ is dependent on $\bar{\bm{\Xi}}$ . Therefore, unlike Theorem 4, the optimal value of (60) is not simply the largest singular value of some given matrix. However, if we fix $\bar{\mathbf{J}}_{k}$ , the optimization problem (60) is reduced to solving the largest singular value of $\bar{\mathbf{J}}_{k}$ . As a result, similar to the approach introduced in Section IV-B, we can alternatively solve the optimal $\bm{\varsigma}$ and $\bar{\mathbf{J}}_{k}$ , as summarized in Algorithm 3.

Similar to Corollary 1, we can compute the sample complexity in closed form for some joint distributions.

Corollary 2.

For the joint distribution $P_{XY}$ as constructed in Corollary 1, all non-zero singular values of the corresponding $\tilde{\mathbf{B}}$ are $\sigma_{1}=\sigma_{2}=\dots=\sigma_{d-1}$ as given by (33). Then, for all $k\in[d-1]$ , we have $\bar{\beta}_{k}(r)=\frac{\sigma_{1}^{2}}{4(1+r)}$ , and thus the corresponding error exponent is

[TABLE]

Proof.

See Appendix S. ∎

V-C The Optimal Number of Samples with the Cost Constraint

In semi-supervised learning, while the labeled samples are more useful than the unlabeled samples in learning problems, it is often much more expensive to acquire the labeled samples than the unlabeled ones. Therefore, it is important to understand the fundamental tradeoff between the sampling cost and the performance in learning tasks. In the following, we investigate such tradeoff for the sample complexity of learning the maximal correlation functions.

Suppose that the costs of acquiring the labeled and unlabeled samples are $\mathsf{C}_{\ell}$ and $\mathsf{C}_{u}$ per sample, respectively, and the total budget for sampling is $\mathsf{C}$ . Then, the number of labeled samples $n_{\ell}$ and the number of unlabeled samples $n_{u}$ we can get are constrained by $n_{\ell}\mathsf{C}_{\ell}+n_{u}\mathsf{C}_{u}\leq\mathsf{C}$ . Without loss of generality, we consider the case $\sigma_{k}>\sigma_{k+1}$ , and it follows from Theorem 4 that the error exponent for estimating the $k$ -dimensional maximal correlation functions with these samples is $\epsilon n_{\ell}/[2\bar{\alpha}_{k}(r)]$ , where $r=n_{u}/n_{\ell}$ . Hence, the optimal error exponent that can be achieved by the sampling budget $\mathsf{C}$ is given by

[TABLE]

which immediately implies the following proposition.

Proposition 4.

Given the sampling budget constraint $\mathsf{C}$ , the optimal number of labeled samples $n_{\ell}$ and unlabeled samples $n_{u}$ to optimize the sample complexity of estimating the $k$ -dimensional maximal correlation functions are

[TABLE]

where

[TABLE]

Note that the optimal ratio $r^{*}$ is independent of $\mathsf{C}$ , which indicates the relative importance of the unlabeled samples compared to the labeled samples, by taking the sampling costs into account. While the optimization problem (63) has no analytical solution and is neither convex nor concave, we can solve the local optimum by the numerical differentiation approach [24]. In particular, the local optimum of $r$ can be computed via the updating rule

[TABLE]

where $h>0$ is the step size for computing the numerical differentiation, and $\eta>0$ is the learning rate for gradient descent.

VI The Numerical Simulations

In this section, we validate our theoretical results by some numerical simulations. In our experiments, we choose $|{\mathcal{X}}|=|{\mathcal{Y}}|=4$ , and the joint distribution $P_{XY}$ as

[TABLE]

In the following, we compare the empirical error exponents (10) and (40) for estimating $k=2$ dimensional maximal correlation functions with the theoretical results. Note that since the joint distribution $P_{XY}$ of (64) is a special case of Corollary 1 and 2, we can apply the results from the corollaries as our theoretical benchmarks.

VI-A Supervised Learning

In this experiment, we sample the learning error $\left\|\tilde{\mathbf{B}}{\mathbb{\Phi}}_{k}\right\|_{\mathrm{F}}^{2}-\left\|\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\|_{\mathrm{F}}^{2}$ as follows. For each sample of the learning error, we first generate $n=10^{6}$ pairs of $(x_{i},y_{i})$ , i.i.d. from $P_{XY}$ , and then compute the $\hat{\mathbf{B}}$ from the empirical distribution $\hat{P}_{XY}$ of these $n$ pairs. Then, the singular vectors of $\hat{\mathbf{B}}$ are computed to get a sample of $\left\|\tilde{\mathbf{B}}{\mathbb{\Phi}}_{k}\right\|_{\mathrm{F}}^{2}-\left\|\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}\right\|_{\mathrm{F}}^{2}$ . We repeat this sampling process for the learning error for $10^{5}$ times, and consider the empirical probability

[TABLE]

over the $10^{5}$ samples. Then, the empirical error exponent can be computed as

[TABLE]

The comparison between the empirical error exponent and the error exponent computed from Corollary 1 is plotted in Fig. 2, in which we can see the coincidence between these two error exponents.

VI-B Semi-supervised Learning

In the experiment for the semi-supervised learning, for each sample of learning error, we take $r=1$ , and generate $n=10^{6}$ pairs of $(x_{i},y_{i})$ , i.i.d. from $P_{XY}$ , and $m=nr=10^{6}$ of $x_{j}$ , i.i.d. from $P_{X}$ , and then compute the $\bar{\mathbf{B}}$ according to the empirical distribution $\bar{P}_{XY}$ from (36). This sampling process for the learning error is repeated for $10^{5}$ times, and the empirical probability of the learning error exceeding $\epsilon$ is defined as

[TABLE]

Then, the empirical error exponent can be computed as

[TABLE]

The comparison between the empirical error exponent and the error exponent computed from Corollary 2 is plotted in Fig. 2, in which we can see the coincidence between these two error exponents.

Appendix A Alternating Conditional Expectation Algorithm (Algorithm 1)

For convenience, we assume that the empirical distribution $\hat{P}_{XY}=P_{XY}$ and thus $\hat{\mathbf{B}}=\tilde{\mathbf{B}}$ . Let $\sigma_{i}$ denote the $i$ -th singular value of $\tilde{\mathbf{B}}$ , we will show that Algorithm 1 converges to the maximal correlation functions $f^{*}$ and $g^{*}$ which achieve the maximal correlation $\rho_{k}(X;Y)=\sum_{i=1}^{k}\sigma_{i}$ .

To begin, let ${\mathbb{\Phi}}_{k}$ and ${\mathbb{\Psi}}_{k}$ be the matrix composed of top $k$ right singular vectors and left singular vectors of $\tilde{\mathbf{B}}$ , respectively, as defined in Section II. Then, from the analyses in Section II, we know that after the alternating conditional expectation processes (cf. line 4–6 of Algorithm 1), $\hat{{\mathbb{\Phi}}}_{k}$ and ${\mathbb{\Phi}}_{k}$ have the same column space. Then, after the whitening of $\hat{f}$ in line 8, we have

[TABLE]

where $\mathbf{Q}\in\mathbb{R}^{k\times k}$ is an orthogonal matrix. Then, with line 9–11, we obtain

[TABLE]

where (67) follows from the fact that $\tilde{\mathbf{B}}\hat{{\mathbb{\Phi}}}_{k}=\tilde{\mathbf{B}}{\mathbb{\Phi}}_{k}\mathbf{Q}={\mathbb{\Psi}}_{k}\bm{\Sigma}_{k}\mathbf{Q}$ , where $\bm{\Sigma}_{k}=\operatorname*{diag}\{\sigma_{1},\dots,\sigma_{k}\}$ . In addition, (68) follows from the fact that ${\mathbb{\Phi}}_{k}^{\mathrm{T}}{\mathbb{\Phi}}_{k}=\mathbf{I}_{k}$ , and (70) follows from that $\mathbf{Q}_{k}\mathbf{Q}_{k}^{\mathrm{T}}=\mathbf{I}_{k}$ since $\mathbf{Q}$ is orthogonal, with $\mathbf{I}_{k}$ representing the identity matrix of order $k$ .

From (65) and (70), we know that

[TABLE]

Finally, we have

[TABLE]

As a consequence, $\hat{f}$ and $\hat{g}$ are the maximal correlation functions that achieve the HGR maximal correlation $\rho_{k}(X;Y)$ .

Appendix B Proof of Eq. (9)

Firstly, note that

[TABLE]

where the second equality follows from the fact that

[TABLE]

since $\bm{\phi}_{1},\dots,\bm{\phi}_{d}$ form an orthonormal set. Therefore, we have

[TABLE]

where the second inequality follows from the assumption that $\langle\hat{\bm{\phi}}_{1},\bm{\phi}_{1}\rangle\geq 0$ . As a consequence, we obtain (9) as desired.

Appendix C Proof of Lemma 1

Suppose $\lambda_{1},\dots,\lambda_{k}$ take $q$ distinct values, and the indices $i_{0},\dots,i_{q}$ are defined such that $0=i_{0}<\dots<i_{q}=k$ and

[TABLE]

Therefore, we have

[TABLE]

where $\bm{v}_{j}(\tau)$ denotes the $j$ -th column of $\mathbf{V}_{k}$ . We first consider the summation for $s=1$ ,

[TABLE]

where $\mathbf{V}_{i_{1}}(\tau)\in\mathbb{R}^{d\times i_{1}}$ is composed of the first $i_{1}$ columns of $\mathbf{V}_{k}(\tau)$ . First, note that since $\mathbf{A}(\tau)$ is analytic, there exists a symmetric matrix $\mathbf{A}^{\prime\prime}$ such that

[TABLE]

In addition, the analyticity of $\mathbf{A}(\tau)$ implies that the eigenspace $\mathbf{V}_{k}(\tau)$ is also analytic [25], and thus has the expansion

[TABLE]

where $\hat{\mathbf{V}}_{i_{1}}$ , $\mathbf{V}_{i_{1}}^{\prime}$ , and $\mathbf{V}_{i_{1}}^{\prime\prime}$ are matrices in $\mathbb{R}^{d\times i_{1}}$ . Moreover, the columns of $\hat{\mathbf{V}}_{i_{1}}$ form an orthonormal basis of the eigenspace of $\mathbf{A}$ associated with $\lambda_{1}$ , and thus $\mathbf{A}\hat{\mathbf{V}}_{i_{1}}=\lambda_{1}\hat{\mathbf{V}}_{i_{1}}$ . Then, from $\mathbf{V}_{i_{1}}^{\mathrm{T}}(\tau)\mathbf{V}_{i_{1}}(\tau)=\mathbf{I}_{i_{1}}$ , we obtain

[TABLE]

which in turn implies

[TABLE]

where $\mathbf{I}_{i_{1}}$ and $\mathbf{O}_{i_{1}}$ are the identity matrix and the zero matrix in $\mathbb{R}^{i_{1}\times i_{1}}$ , respectively. Therefore, we have

[TABLE]

where the penultimate equality follows from the fact that $\mathbf{A}\hat{\mathbf{V}}_{i_{1}}=\lambda_{1}\hat{\mathbf{V}}_{i_{1}}$ , the last equality follows from (72), and $\mathbf{I}_{d}$ is the identity matrix in $\mathbb{R}^{d\times d}$ .

In addition, we define the matrix

[TABLE]

where $\lambda_{1}(\tau),\dots,\lambda_{i_{1}}(\tau)$ are the largest $i_{1}$ eigenvalues of $\mathbf{A}(\tau)$ . Then, it follows from the analyticity of $\mathbf{A}(\tau)$ that $\mathbf{\Lambda}_{i_{1}}(\tau)$ is analytic and can be written as

[TABLE]

where $\mathbf{\Lambda}_{i_{1}}^{\prime}$ and $\mathbf{\Lambda}_{i_{1}}^{\prime\prime}$ are both diagonal matrices. Now, from $\mathbf{A}(\tau)\mathbf{V}_{i_{1}}(\tau)=\mathbf{V}_{i_{1}}(\tau)\mathbf{\Lambda}_{i_{1}}(\tau)$ we obtain

[TABLE]

Comparing the $\tau$ -order terms for both sides, we have

[TABLE]

and thus

[TABLE]

Left multiplying (74) by $\hat{\mathbf{V}}_{i_{1}}^{\mathrm{T}}$ , we obtain

[TABLE]

where we have again exploited the fact that $\mathbf{A}\hat{\mathbf{V}}_{i_{1}}=\lambda_{1}\hat{\mathbf{V}}_{i_{1}}$ .

Now, we can rewrite $\left[\mathbf{V}_{i_{1}}^{\prime\mathrm{T}}(\lambda_{1}\mathbf{I}_{d}-\mathbf{A})\mathbf{V}_{i_{1}}^{\prime}\right]$ of (73) as

[TABLE]

where (76a) follows from (74), and (76c) follows from (75). Furthermore, it follows from the eigen-decomposition of $\mathbf{A}$ that

[TABLE]

where $\hat{\bm{v}}_{j}$ is the $j$ -th column of $\hat{\mathbf{V}}_{i_{1}}$ ( $1\leq j\leq i_{1}$ ). Similarly, we have

[TABLE]

Hence, we obtain

[TABLE]

and its Moore-Penrose inverse

[TABLE]

Therefore, we have

[TABLE]

and hence

[TABLE]

where to obtain (78b) we have used (74), and to obtain (78c) we have used the fact that $(\lambda_{1}\mathbf{I}_{d}-\mathbf{A})^{\dagger}\hat{\mathbf{V}}_{i_{1}}$ is a zero matrix [cf. (77)], since $\hat{\bm{v}}_{i}$ is orthogonal to $\bm{v}_{j}$ for all $i\leq i_{1}<j$ .

Then, from (73), (76) and (78), we obtain

[TABLE]

which implies that

[TABLE]

where (80a) follows from (77), and $\mathbf{V}_{i_{1}}$ of (80c) is defined as $\mathbf{V}_{i_{1}}\triangleq[\bm{v}_{1},\dots,\bm{v}_{i_{1}}]\in\mathbb{R}^{d\times i_{1}}$ . To obtain (80c), we have used the fact that both the columns of $\mathbf{V}_{i_{1}}$ and $\hat{\mathbf{V}}_{i_{1}}$ form an orthonormal basis of the eigenspace of $\mathbf{A}$ associated with the eigenvalue $\lambda_{1}$ .

Moreover, similar to the above derivations, for any $s$ we have

[TABLE]

Then from (71) and (81), we have

[TABLE]

which finishes the proof of the lemma.

Appendix D Proof of Lemma 2

Suppose $\lambda_{1},\dots,\lambda_{k}$ take $q$ distinct values, then we define indices $i_{0},\dots,i_{q}$ such that $0=i_{0}<\dots<i_{q-1}<k<k+1\leq i_{q}$ and

[TABLE]

We first consider the case $q=1$ , which implies $k<i_{1}$ . From (79) we have

[TABLE]

which implies [cf. (80)]

[TABLE]

To obtain $\hat{\bm{v}}_{i}~{}(1\leq i\leq k)$ , note that

[TABLE]

where to obtain the third equality we have used the fact that $\mathbf{A}\hat{\mathbf{V}}_{k}=\lambda_{1}\hat{\mathbf{V}}_{k}$ , and to obtain the last equality used that $\left(\hat{\mathbf{V}}_{k}^{\mathrm{T}}\mathbf{V}_{k}^{\prime}+\mathbf{V}_{k}^{\prime\mathrm{T}}\hat{\mathbf{V}}_{k}\right)$ is a zero matrix as a consequence of (72a).

Since the columns of $\hat{\mathbf{V}}_{k}$ are $k$ orthonormal vectors in the eigenspace of $\mathbf{A}$ associated with the eigenvalue $\lambda_{1}$ , we can write $\hat{\mathbf{V}}_{k}$ as $\hat{\mathbf{V}}_{k}=\mathbf{V}_{i_{1}}\mathbf{U}$ , where ${\mathbf{U}=[\bm{u}_{1},\dots,\bm{u}_{k}]\in\mathbb{R}^{i_{1}\times k}}$ satisfies $\mathbf{U}^{\mathrm{T}}\mathbf{U}=\mathbf{I}_{k}$ . Moreover, from the definition of eigenvectors, $\mathbf{V}_{k}(\tau)$ is the $d\times k$ matrix with orthonormal columns that maximizes

[TABLE]

Therefore, $\mathbf{U}$ is the optimal solution of

[TABLE]

which implies that $\bm{u}_{1},\dots,\bm{u}_{k}$ are the top $k$ eigenvectors of the matrix $\mathbf{V}_{i_{1}}^{\mathrm{T}}\mathbf{A}^{\prime}\mathbf{V}_{i_{1}}$ .

As a result, we have

[TABLE]

where $\hat{\bm{v}}_{i}=\mathbf{V}_{i_{1}}\bm{u}_{i}~{}(1\leq i\leq k)$ , and $\bm{u}_{1},\dots,\bm{u}_{k}$ are the top $k$ eigenvectors of the matrix $\mathbf{V}_{i_{1}}^{\mathrm{T}}\mathbf{A}^{\prime}\mathbf{V}_{i_{1}}$ .

Similarly, for $q>1$ , we have

[TABLE]

where $l$ is the minimal element of $\mathcal{I}_{k}$ , and

[TABLE]

with $\bm{u}_{1},\dots,\bm{u}_{k-l+1}$ being the top $k-l+1$ eigenvectors of $\mathbf{V}_{\mathcal{I}_{k}}^{\mathrm{T}}\mathbf{A}^{\prime}\mathbf{V}_{\mathcal{I}_{k}}$ . Therefore, we obtain

[TABLE]

where (83b) follows from Lemma 1.

Appendix E Proof of Lemma 3

First, let us define $p_{\min}\triangleq\min\{P_{XY}(x,y)\colon(x,y)\in{\mathcal{X}}\times{\mathcal{Y}},P_{XY}(x,y)>0\}$ . Then for all $\hat{P}_{XY}\in\mathcal{N}(\epsilon)$ , we have, using Pinsker’s inequality [26],

[TABLE]

Therefore, for all $(x,y)\in{\mathcal{X}}\times{\mathcal{Y}}$ we obtain

[TABLE]

Thus, it follows from (22) that

[TABLE]

where the last inequality follows from (85) and the fact that $p_{\min}\leq 1$ . Hence, we have $\left\|\bm{\Xi}\right\|_{\mathrm{F}}\leq C$ , where $C\triangleq\frac{1+|{\mathcal{X}}|+|{\mathcal{Y}}|}{p^{3}_{\min}}\sqrt{\frac{2|{\mathcal{X}}||{\mathcal{Y}}|}{\alpha_{k}}}$ depends only on $P_{XY}$ .

To prove (23), we first define $\tau\triangleq\sqrt{\epsilon}$ for the convenience of presentation. From (21), we can represent the differences between the empirical marginal distributions and the true marginal distributions as

[TABLE]

where

[TABLE]

In addition, it follows from (87) and (88) that

[TABLE]

and

[TABLE]

Therefore, from (21) and (87)–(92) we have, for all $(x,y)\in{\mathcal{X}}\times{\mathcal{Y}}$ ,

[TABLE]

which is equivalent to (23).

Appendix F Proof of Lemma 4

For $\epsilon>0$ and $t>0$ , we define the subset $\mathcal{S}^{(t)}_{2}(\epsilon)$ of $\mathcal{N}(\epsilon)$ as

[TABLE]

where $\bm{\phi}_{i}$ denotes the $i$ -th right singular vector of $\tilde{\mathbf{B}}$ , and $\mathbf{\Xi}$ is as defined in (22). Then, it is convenient to first establish the following useful lemma.

Lemma 8.

For all $t\in(0,2)$ , we have

[TABLE]

Using Lemma 8, we establish Lemma 4 as follows. First, for all empirical distributions $\hat{P}_{XY}\in\mathcal{N}(\epsilon)$ , it follows from Lemma 3 that

[TABLE]

From the perturbation analysis result of Lemma 1, we can represent the learning error as

[TABLE]

Therefore, for any $t\in(0,1)$ , there exists an $\epsilon_{0}>0$ such that for all $\epsilon\in(0,\epsilon_{0})$ , we have

[TABLE]

This implies that

[TABLE]

From the first inequality of (99), we obtain

[TABLE]

As $t$ can be chosen to be arbitrarily close to [math], we must have

[TABLE]

It remains only to establish Lemma 8.

Proof of Lemma 8.

Since the set $\mathcal{S}_{2}^{(t)}(\epsilon)$ is closed, we have

[TABLE]

Then, for all $\hat{P}_{XY}\in\mathcal{S}_{2}^{(t)}(\epsilon)\subset\mathcal{N}(\epsilon)$ with $\hat{P}_{XY}\leftrightarrow\Gamma$ , it follows from (85) that $\Gamma(y,x)$ is bounded for all $(x,y)\in{\mathcal{X}}\times{\mathcal{Y}}$ . Hence, it follows from the second-order Taylor series expansion of the K-L divergence that

[TABLE]

Moreover, since $\hat{P}_{XY}$ and $P_{XY}$ are probability distributions, it follows from (21) that

[TABLE]

Therefore, the characterization of (95) leads to the following optimization problem:

[TABLE]

As we will verify, although not imposed as a constraint, the condition $\hat{P}_{XY}\in\mathcal{N}(\epsilon)$ can be satisfied for the optimal $\bm{\Gamma}$ . Note that since both the objective function and the inequality constraint (104b) are quadratic, the optimal solution of (104) can be obtained via solving

[TABLE]

where we have interchanged the objective function and the quadratic function in the inequality constraint. Furthermore, we can show that (105) is equivalent to the optimization problem without the equality constraint, i.e,

[TABLE]

To see this, suppose that $\bm{\Gamma}^{*}$ is the optimal solution of (106) with $c\triangleq\sum_{x\in{\mathcal{X}},y\in{\mathcal{Y}}}\sqrt{P_{XY}(x,y)}\Gamma^{*}(y,x)$ . Then, let $z(x,y)\triangleq\Gamma^{*}(y,x)-c\sqrt{P_{XY}(x,y)}$ , and we have

[TABLE]

which implies $|c|\leq 1$ .

If $|c|=1$ , we have $\Gamma^{*}(y,x)=\pm\sqrt{P_{XY}(x,y)}$ , and it follows from (22) that $\Xi(y,x)=\mp\sqrt{P_{X}(x)P_{Y}(y)}$ . Hence, we have $\mathbf{\Xi}=\mp\bm{\psi}_{0}\bm{\phi}_{0}^{\mathrm{T}}$ , where $\bm{\psi}_{0}$ is a $|{\mathcal{Y}}|$ -dimensional vector with its $y$ -th element being $\sqrt{P_{Y}(y)}$ , and $\bm{\phi}_{0}\in\mathbb{R}^{|{\mathcal{X}}|}$ with the $x$ -th element being $\sqrt{P_{X}(x)}$ . Then, the objective function is zero since $\tilde{\mathbf{B}}^{\mathrm{T}}\bm{\Xi}=\mp\tilde{\mathbf{B}}^{\mathrm{T}}\bm{\psi}_{0}\bm{\psi}_{0}^{\mathrm{T}}=\mathbf{O}$ , which contradicts the assumption that $\bm{\Gamma}^{*}$ is optimal. Moreover, if $0<|c|<1$ , then we can construct the matrix $\bm{\Gamma}^{\prime}$ with elements $\Gamma^{\prime}(y,x)=z(x,y)/\sqrt{1-c^{2}}$ . It can be verified that $\left\|\bm{\Gamma}^{\prime}\right\|_{\mathrm{F}}^{2}=1$ and the objective function in (105) for $\bm{\Gamma}^{\prime}$ is $1/(1-c^{2})$ times the corresponding value for $\bm{\Gamma}^{*}$ . This again contradicts the optimality of $\bm{\Gamma}^{*}$ . Therefore, we have $c=0$ , and the optimization problem (106) has the same solution as that of (105).

In addition, it can be shown that for $(x^{\prime},y^{\prime})\in{\mathcal{X}}\times{\mathcal{Y}}$ such that $P_{XY}(x^{\prime},y^{\prime})=0$ , we must have $\Gamma^{*}(y^{\prime},x^{\prime})=0$ , since otherwise we can set $\Gamma^{*}(y^{\prime},x^{\prime})=0$ and rescale $\bm{\Gamma}^{*}$ to $\left\|\bm{\Gamma}^{*}\right\|_{\mathrm{F}}^{2}=1$ , which increases the objective function of (106) due to (22). Therefore, the optimal solution $\bm{\Gamma}^{*}$ satisfies the definition (21).

To simplify the objective function (106a), we employ the vectorization operation $\operatorname{vec}(\cdot)$ that stacks all columns of a matrix into a vector. Specifically, for $\mathbf{W}=[w_{ij}]\in\mathbb{R}^{p\times q}$ , we use $\operatorname{vec}(\mathbf{W})$ to denote the $pq$ -dimensional column vector with the $[p(j-1)+i]$ -th entry being $w_{ij}$ . Then, we can rewrite (106a) as

[TABLE]

where to obtain (107b)–(107c) we have used the properties of trace that

[TABLE]

and to obtain (107d) we have used the fact that

[TABLE]

Moreover, it follows from (22) and (14) that $\operatorname{vec}(\bm{\Xi})=\mathbf{L}\operatorname{vec}\left(\bm{\Gamma}\right)$ . Thus, (107e) can be reduced to

[TABLE]

Since $\left\|\operatorname{vec}\left(\bm{\Gamma}\right)\right\|=\|\bm{\Gamma}\|_{\mathrm{F}}$ , the constraint of (106) is equivalent to $\left\|\operatorname{vec}\left(\bm{\Gamma}\right)\right\|\leq 1$ . Therefore, the maximum of (108) is the largest singular value $\alpha_{k}$ of $\mathbf{G}_{k}$ , which is the optimal value of the objective functions in (106) and (105). This implies that the optimal solution of the original optimization problem (104) is $\sqrt{\frac{t}{\alpha_{k}}}\bm{\Gamma}^{*}$ , with the corresponding optimal value being $t\alpha_{k}^{-1}$ . Let $\hat{P}^{*}_{XY}\leftrightarrow\sqrt{\frac{t}{\alpha_{k}}}\Gamma^{*}$ denote the corresponding optimal empirical distribution, then we have, for $\epsilon$ sufficiently small,

[TABLE]

where we have used the fact that $t\in(0,2)$ .

Hence, we obtain $\hat{P}^{*}_{XY}\in\mathcal{N}(\epsilon)$ and thus

[TABLE]

which implies (95).

∎

Appendix G Proof of Theorem 2

First, it follows from (27) that there exists an $\epsilon_{0}>0$ such that for all $\epsilon\in(0,\epsilon_{0})$ we have

[TABLE]

where we have defined $\kappa=(3\alpha_{k})^{-1}$ .

Then, using Sanov’s theorem, we have for all $\epsilon\in(0,\epsilon_{0})$ ,

[TABLE]

where to obtain (112) we have used (110), to obtain (113) we have used the fact that $n\geq 1$ , and to obtain (115) we have used the fact that $x\leq e^{x}-1<e^{x}$ .

Therefore, it suffices to choose $n$ such that

[TABLE]

which is equivalent to

[TABLE]

where we have used the fact that $\kappa=(3\alpha_{k})^{-1}$ .

Appendix H Proof of Proposition 1

First, note that (13) can be reduced to

[TABLE]

where

[TABLE]

where $\bm{\psi}_{i}\in\mathbb{R}^{|{\mathcal{Y}}|}$ is the $i$ -th left singular vector of $\tilde{\mathbf{B}}$ . Then, from the facts that $\sigma_{d-1}>0=\sigma_{d}$ and $d=|{\mathcal{X}}|\leq|{\mathcal{Y}}|$ , we know that $\bm{\phi}_{d}$ is the only right singular vector associated with the singular value [math], and thus we have

[TABLE]

Then it follows from (14) that the $[(x-1)|{\mathcal{Y}}|+y]$ -th entry of $\left(\mathbf{L}^{\mathrm{T}}\bm{\theta}_{id}\right)$ is

[TABLE]

where $\phi_{i}(x)$ and $\psi_{j}(y)$ denote the $x$ -th entry of $\bm{\phi}_{i}$ and the $y$ -th entry of $\bm{\psi}_{j}$ , respectively, and where to obtain (119a) we have exploited the fact that

[TABLE]

In addition, to obtain (119c), we have used the facts that

[TABLE]

and for $1\leq i\leq d-1$ ,

[TABLE]

where (121b) follows from the fact that the vector $\Bigl{[}\sqrt{P_{Y}(1)},\dots,\sqrt{P_{Y}(|{\mathcal{Y}}|)}\Bigr{]}^{\mathrm{T}}\in\mathbb{R}^{|{\mathcal{Y}}|}$ is a left singular vector of the matrix $\tilde{\mathbf{B}}$ associated with the singular value [math].

Hence, from (119) we have

[TABLE]

where $\mathbf{M}\in\mathbb{R}^{(|{\mathcal{X}}|\cdot|{\mathcal{Y}}|)\times|{\mathcal{X}}|}$ is a block diagonal matrix defined as

[TABLE]

As a result, it follows from (117) that

[TABLE]

from which we can obtain the eigen-decomposition of $\mathbf{G}_{k}$ . Indeed, since $\mathbf{M}^{\mathrm{T}}\mathbf{M}=\mathbf{I}_{d}$ , we have $\langle\mathbf{M}\bm{\phi}_{i},\mathbf{M}\bm{\phi}_{j}\rangle=\langle\bm{\phi}_{i},\bm{\phi}_{j}\rangle=\delta_{ij}$ . Therefore, from (123), the non-zero eigenvalues of $\mathbf{G}_{k}$ are $\sigma_{i}^{2}/4\,(i=1,\dots,d-1)$ , with the corresponding eigenvectors $\mathbf{M}\bm{\phi}_{i}\,(i=1,\dots,d-1)$ . As a result, the largest eigenvalue (i.e., the largest singular value) of $\mathbf{G}_{k}$ is

[TABLE]

where $\|\cdot\|_{\mathrm{s}}$ denotes the spectral norm of its argument.

Appendix I Proof of Theorem 3

The proof is similar to that of Theorem 1, except that we need to replace the perturbation analysis result of Lemma 1 with the corresponding result of Lemma 2. In particular, we extend the definition $\mathcal{N}(\epsilon)$ to the case $\sigma_{k}=\sigma_{k+1}$ via letting121212It can be verified that, we have $\beta_{k}=\alpha_{k}$ if $\sigma_{k}>\sigma_{k+1}$ . Therefore, the definition (124) is a generalization of (20).

[TABLE]

Then, we define $\mathcal{S}^{(t)}_{3}(\epsilon)$ as the set of $\hat{P}_{XY}$ such that the corresponding $\Gamma$ from (21) satisfies

[TABLE]

where $\bm{\varphi}_{i}$ are as defined in (30). Then, analogous to Lemma 8, it is convenient to first establish the following result.

Lemma 9.

When $\sigma_{k}=\sigma_{k+1}$ , for all $t\in(0,2)$ , we have

[TABLE]

Proof.

The proof is similar to that of Lemma 8. Using the second-order Taylor series expansion of the K-L divergence (102), the limit (126) can be characterized by the following optimization problem:

[TABLE]

Following the same argument as that for Lemma 8, the optimal solution of (127) can be obtained by solving

[TABLE]

where we have interchanged the objective function and the quadratic function in the inequality constraint, and removed the equality constraint.

Then, similar to (107), we can rewrite the objective function (128a) as

[TABLE]

where the second equality follows from the fact that $\operatorname{vec}(\bm{\Xi})=\mathbf{L}\operatorname{vec}(\bm{\Gamma})$ . As a result, the optimization problem (128) can be rewritten as (31), and thus the optimal value is $\beta_{k}$ . Note that if $\sigma_{k}>\sigma_{k+1}$ , we may let $\bm{\varphi}_{i}=\bm{\phi}_{i}$ for $i=l,\dots,k$ since it does not change the value of (128a). Then, it can be verified that the optimal value of (128) is $\alpha_{k}$ , i.e., we have $\beta_{k}=\alpha_{k}$ if $\sigma_{k}>\sigma_{k+1}$ .

Finally, using the same argument as that for Lemma 8, we conclude that the optimal value of (127) is $t/\beta_{k}$ and we have

[TABLE]

which implies (126). ∎

In addition, it follows from Lemma 2 and Lemma 3 that the corresponding learning error for the empirical distribution $\hat{P}_{XY}\in\mathcal{N}(\epsilon)$ is

[TABLE]

Therefore, for any $t\in(0,1)$ , there exists an $\epsilon_{0}>0$ such that for all $\epsilon\in(0,\epsilon_{0})$ , we have

[TABLE]

Then, using arguments similar to (99)–(101), we conclude

[TABLE]

Finally, following the same proof as that for Theorem 1, we obtain (32).

Appendix J Proof of Corollary 1

We first introduce two useful lemmas.

Lemma 10.

Suppose $P_{XY}$ is as defined in Corollary 1 with the corresponding matrix $\tilde{\mathbf{B}}$ as given by (3). Then, the matrix $\tilde{\mathbf{B}}$ have singular values

[TABLE]

In addition, for all $\bm{\phi}=[\phi(1),\dots,\phi(d)]^{\mathrm{T}}\in\mathbb{R}^{d}$ with

[TABLE]

the corresponding $\bm{\psi}\triangleq\sigma_{1}^{-1}\tilde{\mathbf{B}}^{\mathrm{T}}\bm{\phi}=\left[\psi(1),\dots,\psi(|{\mathcal{Y}}|)\right]^{\mathrm{T}}\in\mathbb{R}^{|{\mathcal{Y}}|}$ satisfies

[TABLE]

where $\mathbf{1}_{d}$ denotes the vector in $\mathbb{R}^{d}$ with all entries being $1$ .

Proof.

From the definition of $P_{XY}$ , we have

[TABLE]

and

[TABLE]

Therefore, from (3) we have

[TABLE]

where $\mathbf{O}_{|{\mathcal{Y}}|-d,d}$ is the zero matrix in $\mathbb{R}^{(|{\mathcal{Y}}|-d)\times d}$ . As a result, we have

[TABLE]

Since the matrix

[TABLE]

has eigenvalues $\lambda_{1}=\dots=\lambda_{d-1}=1$ and $\lambda_{d}=0$ , we obtain the singular values $\sigma_{1},\dots,\sigma_{d}$ of $\tilde{\mathbf{B}}$ as given by (130), and thus we can rewrite (133) as

[TABLE]

Hence, for all $\bm{\phi}\in\mathbb{R}^{d}$ with (131), we have

[TABLE]

where $\bm{0}_{|{\mathcal{Y}}|-d}$ is the zero vector in $\mathbb{R}^{|{\mathcal{Y}}|-d}$ . ∎

Lemma 11.

For all $\bm{\phi}=[\phi(1),\dots,\phi(d)]^{\mathrm{T}}\in\mathbb{R}^{d}$ satisfying (131), we have

[TABLE]

where the inequality holds with equality if and only if $\bm{\phi}=\pm\bm{\phi}^{\prime}$ , where

[TABLE]

Proof.

From $\langle\bm{\phi},\mathbf{1}_{d}\rangle=0$ we have

[TABLE]

Therefore, we obtain

[TABLE]

where the inequality follows from the fact that the arithmetic mean is no greater than the root mean square. As a result, we have

[TABLE]

where the inequality holds with equality if and only if

[TABLE]

Hence, it follows from (131) and (136) that $\bm{\phi}=\pm\bm{\phi}^{\prime}$ . ∎

Now, Corollary 1 can be proved as follows.

Proof of Corollary 1.

From Lemma 10, we have $\sigma_{1}=\dots=\sigma_{d-1}>\sigma_{d}=0$ . Therefore, for all $1\leq k\leq d-1$ we have $\mathcal{I}_{k}=[d-1]$ , which further implies that $l=\min\mathcal{I}_{k}=1$ and $\mathcal{I}^{\mathsf{c}}_{k}=\{d\}$ . Hence, from (29) we have

[TABLE]

In addition, following the same derivation as that in (119), we have

[TABLE]

and thus

[TABLE]

Note that since $\left\langle\mathbf{M}\bm{\varphi}_{i},\mathbf{M}\bm{\varphi}_{j}\right\rangle=\delta_{ij}$ , (139) demonstrates the eigen-decomposition of $\mathbf{G}_{k}$ . Therefore, from Theorem 3, we have

[TABLE]

To prove the inequality holds with equality, it suffices to show that there exists a $\bm{\Gamma}$ with $\|\bm{\Gamma}\|_{\mathrm{F}}\leq 1$ such that

[TABLE]

Indeed, as we now illustrate, if $\bm{\Gamma}$ is chosen as

[TABLE]

with $\phi^{\prime}$ as defined in (135), then we have $\bm{\varphi}_{1}=\pm\bm{\phi}^{\prime}$ and $\operatorname{vec}\left(\bm{\Gamma}\right)=\mathbf{M}\bm{\phi}^{\prime}=\pm\mathbf{M}\bm{\varphi}_{1}$ , and thus (140) holds.

To see this, first note that from (141) we have $\|\bm{\Gamma}\|_{\mathrm{F}}=1$ ,

[TABLE]

and

[TABLE]

where $\bm{\psi}^{\prime}=\left[\psi^{\prime}(1),\dots,\psi^{\prime}(|{\mathcal{Y}}|)\right]^{\mathrm{T}}\triangleq\sigma_{1}^{-1}\tilde{\mathbf{B}}\bm{\phi}^{\prime}$ .

Therefore, from (22) we obtain

[TABLE]

In addition, since $\mathcal{I}_{k}=[d-1]$ , from (30), $\bm{\varphi}_{1}$ is the solution of the optimization problem

[TABLE]

where $\bm{\phi}_{d}$ is the $d$ -th right singular vector of $\tilde{\mathbf{B}}$ . Since $\sigma_{d-1}>0=\sigma_{d}$ and $d=|{\mathcal{X}}|\leq|{\mathcal{Y}}|$ , we know that

[TABLE]

and thus $\langle\bm{\phi},\bm{\phi}_{d}\rangle=0$ is equivalent to $\langle\bm{\phi},\mathbf{1}_{d}\rangle=0$ .

Now, for all $\bm{\phi}$ satisfying the constraints of (143), the objective function of (143) is

[TABLE]

where $\bm{\psi}\triangleq\sigma_{1}^{-1}\tilde{\mathbf{B}}\bm{\phi}$ , and where to obtain (144c) we have used the fact that $\langle\bm{\phi},\bm{\phi}_{d}\rangle=0$ , to obtain (144d) we have used the facts that $\tilde{\mathbf{B}}\bm{\phi}=\sigma_{1}\bm{\psi}$ and $\tilde{\mathbf{B}}^{\mathrm{T}}\bm{\psi}=\sigma_{1}\bm{\phi}$ , and to obtain (144e) we have used Lemma 10 and the facts that

[TABLE]

Furthermore, to maximize (144f), note that

[TABLE]

As a result, if follows from Lemma 11 that (144f) is maximized when $\bm{\phi}=\pm\bm{\phi}^{\prime}$ , i.e., we have $\bm{\varphi}_{1}=\pm\bm{\phi}^{\prime}$ , which finishes the proof. ∎

Appendix K The Generalized ACE Algorithm (38)

First, we define the $|{\mathcal{X}}|$ and $|{\mathcal{Y}}|$ dimensional vectors $\bar{\bm{\phi}}_{i}$ and $\bar{\bm{\psi}}_{i}$ , respectively, for $i=1,\ldots,k$ as

[TABLE]

where $\bar{P}_{X}$ and $\bar{P}_{Y}$ are the marginal distributions of $\bar{P}_{XY}$ , and $\bar{f}_{i}$ and $\bar{g}_{i}$ are the $i$ -th dimension of $\bar{f}$ and $\bar{g}$ , i.e.,

[TABLE]

for all $x$ and $y$ . Then, the iterative steps of the generalized ACE algorithm (38) can be equivalently expressed as

[TABLE]

where

[TABLE]

Note that (145) coincides with the alternating least squares algorithm [19] for solving the low-rank approximation problem

[TABLE]

Then, using the same argument as that in Appendix A, we know that the generalized ACE algorithm (38) essentially computes the singular vectors of $\bar{\mathbf{B}}$ with respect to the top $k$ singular values.

Appendix L Proof of Lemma 5

For any $\bar{P}_{XY}\in\bar{\mathcal{N}}(\epsilon)$ with the corresponding empirical distributions $\hat{P}_{XY}$ and $Q_{X}$ , it follows from (49) that

[TABLE]

Then, following the same argument as that for (84), we obtain

[TABLE]

In addition, from (52) we have

[TABLE]

where to obtain the second inequality we have used the fact that $\frac{r}{1+r}\leq\frac{\sqrt{r}}{2}\leq\sqrt{r}$ .

Then, it follows from (86c) that

[TABLE]

Hence, we have $\|\bar{\bm{\Xi}}\|_{\mathrm{F}}\leq\bar{C}$ with $\bar{C}\triangleq\frac{(1+|{\mathcal{X}}|+|{\mathcal{Y}}|)^{2}}{p^{3}_{\min}}\sqrt{\frac{2|{\mathcal{X}}||{\mathcal{Y}}|}{\bar{\alpha}_{k}(r)}}$ .

Turning now to the second part of the lemma, for the convenience of representation, in the following we use $\tau$ to replace $\sqrt{\epsilon}$ .

From (35), (50) and (87), we conclude

[TABLE]

where $\Gamma_{X}$ and $\zeta$ are as defined in (87) and (50). In addition, it follows from (21) and (87) that

[TABLE]

where $\Gamma$ is as defined in (21).

Therefore, we have

[TABLE]

Finally, it follows from (87)–(93) that

[TABLE]

Appendix M Proof of Lemma 6

First, note that

[TABLE]

where $T(\hat{P}_{XY})$ and $T(Q_{X})$ denote the type class of $\hat{P}_{XY}$ and the type class of $Q_{X}$ , respectively, and the last equality follows from the fact that $Q_{X}$ is independent of $\hat{P}_{XY}$ . Then, the probabilities of the two type classes are [26]

[TABLE]

and

[TABLE]

Moreover, for both type classes, the numbers of types are at most polynomial in $n$ . Therefore, via the Laplace principle [27] it follows that

[TABLE]

Appendix N Proof of Lemma 7

For $\epsilon>0$ and $t>0$ , we define the subset $\bar{\mathcal{S}}^{(t)}_{2}(\epsilon)$ of $\bar{\mathcal{N}}(\epsilon)$ as

[TABLE]

where $\bar{\bm{\Xi}}$ is as defined in (51). Then, it is convenient to first establish the following useful lemma.

Lemma 12.

For all $t\in(0,2)$ , we have

[TABLE]

Using Lemma 12, we establish Lemma 7 as follows. First, for all $\bar{P}_{XY}\in\bar{\mathcal{N}}(\epsilon)$ , it follows from Lemma 5 that

[TABLE]

From the perturbation analysis result of Lemma 1, we can represent the learning error as

[TABLE]

Therefore, for any $t\in(0,1)$ , there exists an $\epsilon_{0}>0$ such that for all $\epsilon\in(0,\epsilon_{0})$ , we have

[TABLE]

Then, using arguments similar to (99)–(101), from Lemma 12 we obtain

[TABLE]

It remains only to establish Lemma 12.

Proof of Lemma 12.

Since the set $\bar{\mathcal{S}}_{2}^{(t)}(\epsilon)$ is closed, we have

[TABLE]

Then, for all $\bar{P}_{XY}\in\bar{\mathcal{S}}_{2}^{(t)}$ with the corresponding empirical distributions $\hat{P}_{XY}\leftrightarrow\Gamma$ and $Q_{X}\leftrightarrow\zeta$ for labeled and unlabeled data, it follows from the second-order Taylor series expansion of the K-L divergence that

[TABLE]

Therefore, the characterization of the error exponent (40) can be reduced to the following optimization problem:

[TABLE]

where the equality constraints follow from the definitions of $\Gamma$ and $\zeta$ . As we will verify, although not imposed as a constraint, the condition $\bar{P}_{XY}\in\bar{\mathcal{N}}(\epsilon)$ can be satisfied for the optimal $(\mathbf{\Gamma},\bm{\zeta})$ . Since both the objective function and the inequality constraint of (155) are quadratic, the optimal solution can be obtained via solving

[TABLE]

where we have again interchanged the objective function and the quadratic function in the inequality constraint. Then, with arguments similar to those of the supervised case, we can verify the optimal solution of (156) also satisfies (21) and (52). Furthermore, it can be verified that (156) is equivalent to the optimization problem without the equality constraints, i.e.,

[TABLE]

To see this, suppose $(\mathbf{\Gamma}^{*},\bm{\zeta}^{*})$ is the optimal solution of (157), and define $c_{1}\triangleq\sum_{x\in{\mathcal{X}},y\in{\mathcal{Y}}}\Gamma^{*}(y,x)\sqrt{P_{XY}(x,y)}$ and $c_{2}\triangleq\sum_{x\in{\mathcal{X}}}\zeta^{*}(x)$ . With $z_{1}(x,y)\triangleq\Gamma^{*}(y,x)-c_{1}\sqrt{P_{XY}(x,y)}$ and $z_{2}(x,y)\triangleq\zeta^{*}(x)-c_{2}\sqrt{P_{X}(x)}$ , we have

[TABLE]

which implies $c_{1}^{2}+rc_{2}^{2}\leq 1$ .

If $c_{1}^{2}+rc_{2}^{2}=1$ , then we have $z_{1}(x,y)\equiv 0$ and $z_{2}(x)\equiv 0$ , which implies $\Gamma^{*}(y,x)=c_{1}\sqrt{P_{XY}(x,y)}$ and $\zeta^{*}(x)=c_{2}\sqrt{P_{X}(x)}$ . Therefore, it follows from (51) that

[TABLE]

which implies that $\tilde{\mathbf{B}}^{\mathrm{T}}\bar{\bm{\Xi}}$ is a zero matrix. As a result, the objective function of (156) is zero, which contradicts the optimality of $(\mathbf{\Gamma}^{*},\bm{\zeta}^{*})$ . Moreover, if $c_{1}^{2}+rc_{2}^{2}<1$ , we can construct a feasible solution $(\mathbf{\Gamma}^{\prime},\bm{\zeta}^{\prime})$ with

[TABLE]

and it is straightforward to verify that the objective function for $(\mathbf{\Gamma}^{\prime},\bm{\zeta}^{\prime})$ is $\left(1-c_{1}^{2}-rc_{2}^{2}\right)^{-1}$ times the value for $(\mathbf{\Gamma}^{*},\bm{\zeta}^{*})$ . This again contradicts the optimality of $(\mathbf{\Gamma}^{*},\bm{\zeta}^{*})$ . Therefore, we have $c_{1}=c_{2}=0$ , and the optimization problem (157) has the same solution as (156).

To simplify the optimization problem (157), we define the vector $\bm{\varsigma}\in\mathbb{R}^{|{\mathcal{X}}|(|{\mathcal{Y}}|+1)}$ as

[TABLE]

and let $\mathbf{\Upsilon}$ be the $|{\mathcal{Y}}|\times|{\mathcal{X}}|$ matrix with the entries $\Upsilon(y,x)$ . Then, it follows from (42) and (52) that $\operatorname{vec}(\mathbf{\Upsilon})=\bar{\mathbf{L}}(r)\bm{\varsigma}$ .

Therefore, the objective function of (157) can be rewritten as

[TABLE]

where to obtain (160a) we have used (107), and to obtain (160c) we have used (13). In addition, since $\|\bm{\varsigma}\|^{2}=\|\mathbf{\Gamma}\|_{\mathrm{F}}^{2}+r\|\bm{\zeta}\|^{2}$ , the constraint of (157) can be rewritten as $\|\bm{\varsigma}\|\leq 1$ .

As a result, the maximum of (160e) is the spectrum norm of $\bar{\mathbf{G}}_{k}(r)$ , i.e., $\bar{\alpha}_{k}(r)$ , which is also the optimal value of the objective functions in (157) and (156). This implies that the optimal solution of the original optimization problem (155) is $\left(\sqrt{\frac{t}{\bar{\alpha}_{k}(r)}}\mathbf{\Gamma}^{*},\sqrt{\frac{t}{\bar{\alpha}_{k}(r)}}\bm{\zeta}^{*}\right)$ , with the corresponding optimal value being $t/\bar{\alpha}_{k}(r)$ . Let $\hat{P}^{*}_{XY}\leftrightarrow\sqrt{\frac{t}{\alpha_{k}}}\Gamma^{*}$ and $Q^{*}_{X}\leftrightarrow\sqrt{\frac{t}{\alpha_{k}}}\zeta^{*}$ denote the corresponding empirical distributions, then we have, for $\epsilon$ sufficiently small,

[TABLE]

where to obtain the inequality we have used the fact that $t\in(0,2)$ .

Hence, the corresponding optimal distribution $\bar{P}^{*}_{XY}$ as defined in (36) satisfies $\bar{P}^{*}_{XY}\in\bar{\mathcal{N}}(\epsilon)$ . Therefore, we conclude

[TABLE]

which implies (152).

∎

Appendix O Proof of Theorem 5

First, it follows from (57) that there exists an $\bar{\epsilon}_{0}>0$ that depends only on $P_{XY}$ and $r$ such that for all $\epsilon\in(0,\bar{\epsilon}_{0})$ we have

[TABLE]

where we have defined $\bar{\kappa}\triangleq[{3}{\bar{\alpha}_{k}(r)}]^{-1}$ .

Then, for all $\epsilon\in(0,\bar{\epsilon}_{0})$ , it follows from (150) that

[TABLE]

where $T(\hat{P}_{XY})$ and $T(Q_{X})$ denote the type class of $\hat{P}_{XY}$ and the type class of $Q_{X}$ , respectively, and where (163) follows from the upper bound of probability of type classes [26, Theorem 11.1.4], where (165) follows from the upper bound of the number of types [26, Theorem 11.1.1]. In addition, (166) follows from (161), (167) follows from $n\geq 1$ , (170) follows from the fact that $x\leq e^{x}-1<e^{x}$ , and (172) follows from the fact that $1+|{\mathcal{Y}}|\leq 3|{\mathcal{Y}}|/2$ since $|{\mathcal{Y}}|\geq 2$ .

Therefore, it suffices to choose $n$ such that

[TABLE]

which is equivalent to

[TABLE]

where we have used the fact that $\bar{\kappa}=[3\bar{\alpha}_{k}(r)]^{-1}$ .

Appendix P Proof of Proposition 2

First, we write the matrix $\bar{\mathbf{L}}(r)$ as defined in (42) as $\bar{\mathbf{L}}(r)=\left[\bar{\mathbf{L}}_{1}(r),\bar{\mathbf{L}}_{2}(r)\right]$ , where $\bar{\mathbf{L}}_{1}(r)$ is composed of the first $(|{\mathcal{X}}|\cdot|{\mathcal{Y}}|)$ columns of $\bar{\mathbf{L}}(r)$ , and $\bar{\mathbf{L}}_{2}(r)$ is composed of the rest $|{\mathcal{X}}|$ columns of $\bar{\mathbf{L}}$ . Then it follows from the definition of $\bar{\mathbf{L}}(r)$ that

[TABLE]

where $\mathbf{I}_{|{\mathcal{X}}|\cdot|{\mathcal{Y}}|}$ is the identity matrix in $\mathbb{R}^{(|{\mathcal{X}}|\cdot|{\mathcal{Y}}|)\times(|{\mathcal{X}}|\cdot|{\mathcal{Y}}|)}$ , and $\mathbf{M}$ is as defined in (122).

Therefore, we have

[TABLE]

where to obtain (174c) we have exploited the fact that $\mathbf{M}^{\mathrm{T}}\mathbf{M}$ is the identity matrix in $\mathbb{R}^{|{\mathcal{X}}|}$ .

Then, with $\|\cdot\|_{\mathrm{s}}$ denoting the spectral norm, we have

[TABLE]

where $\mathbf{G}_{k}^{\frac{1}{2}}$ is defined as the positive semidefinite matrix $\mathbf{C}$ such that $\mathbf{C}^{2}=\mathbf{G}_{k}$ , and where (175b) follows from the fact that for all matrices $\mathbf{A}$ , we have

[TABLE]

Moreover, from $\mathbf{M}^{\mathrm{T}}\mathbf{M}=\mathbf{I}_{|{\mathcal{X}}|\cdot|{\mathcal{Y}}|}$ we have

[TABLE]

where we have defined

[TABLE]

Then, it follows from (175c) and (176)–(177) that

[TABLE]

Furthermore, for all $r_{2}>r_{1}\geq 0$ , we define $\hat{\mathbf{P}}$ as

[TABLE]

then it can be verified that $\hat{\mathbf{P}}$ satisfies $\bigl{\|}\hat{\mathbf{P}}\bigr{\|}_{\mathrm{s}}=1$ and $\mathbf{P}(r_{2})=\mathbf{P}(r_{1})\hat{\mathbf{P}}=\hat{\mathbf{P}}\mathbf{P}(r_{1})$ . Hence, we have

[TABLE]

where the inequality follows from the submultiplicativity of the spectral norm [28].

To prove the convexity of $\bar{\alpha}_{k}(r)$ , we first define the function $w(r)=\frac{r}{1+r}$ for $r\geq 0$ . Since $w(r)$ is an increasing and concave function of $r$ , we have, for all $r_{1},r_{2}>0$ and $\theta\in(0,1)$ ,

[TABLE]

which implies that

[TABLE]

Therefore, we have

[TABLE]

where the first equality follows from the fact that $\bar{\alpha}_{k}(r)$ is non-increasing, and the second equality follows from the triangle inequality for the spectral norm.

Finally, to obtain the lower bound of (46), note that

[TABLE]

where (178b) follows from the triangle inequality, (178c) follows from (176), (178d) follows from the submultiplicativity of the spectral norm, and the penultimate equality follows from the fact that $\|\mathbf{M}\|_{\mathrm{s}}=\sqrt{\left\|\mathbf{M}^{\mathrm{T}}\mathbf{M}\right\|_{\mathrm{s}}}=1$ , since $\mathbf{M}^{\mathrm{T}}\mathbf{M}$ is an identity matrix.

To obtain the upper bound of (46), note that

[TABLE]

where we have again used the triangle inequality.

Appendix Q Proof of Proposition 3

First, note that from (41) we have

[TABLE]

where the second equality follows from (123), and in the last equality we have defined

[TABLE]

In addition, note that $\hat{\mathbf{M}}(r)$ satisfies

[TABLE]

where to obtain the second equality we have used (174c). Therefore, we have $\left\langle\hat{\mathbf{M}}(r)\bm{\phi}_{i},\hat{\mathbf{M}}(r)\bm{\phi}_{j}\right\rangle=\langle\bm{\phi}_{i},\bm{\phi}_{j}\rangle=\delta_{ij}$ , and it follows from (179) that the non-zero eigenvalues of $\bar{\mathbf{G}}_{k}(r)$ are

[TABLE]

Hence, the largest eigenvalue (i.e., the largest singular value) of $\bar{\mathbf{G}}_{k}(r)$ is

[TABLE]

Appendix R Proof of Theorem 6

Similar to the proof of Theorem 3, we first extend the definition of $\bar{\mathcal{N}}(\epsilon)$ to the case $\sigma_{k}=\sigma_{k+1}$ via letting

[TABLE]

and define $\bar{\mathcal{S}}^{(t)}_{3}(\epsilon)$ as the set of $\bar{P}_{XY}$ such that the corresponding $\Upsilon$ from (52) satisfies

[TABLE]

where $\bar{\bm{\varphi}}_{i}$ are as defined in (59). Then the following result, analogous to Lemma 12 for the case $\sigma_{k}>\sigma_{k+1}$ , will be useful in our analysis.

Lemma 13.

For all $t\in(0,2)$ , we have

[TABLE]

Proof.

The proof is similar to that of Lemma 9. Using the second-order Taylor series expansion of the K-L divergence (154), the limit (184) can be characterized by the following optimization problem:

[TABLE]

Following the same argument as that for Lemma 12, the optimal solution of (185) can be obtained by solving

[TABLE]

where we have interchanged the objective function and the quadratic function in the inequality constraint, and removed the equality constraints.

In addition, similar to (160), we can rewrite the objective function of (186) as

[TABLE]

where $\bar{\mathbf{J}}_{k}(r,\bm{\Gamma},\bm{\zeta})$ is as defined in (58). As a result, the optimization problem (186) can be rewritten as (60), and thus the optimal value is $\bar{\beta}_{k}(r)$ . Finally, using the same argument as that for Lemma 12, we conclude that the optimal value of (60) is $t/\bar{\beta}_{k}(r)$ and thus

[TABLE]

which implies (184). ∎

In addition, it follows from Lemma 2 and Lemma 5 that the corresponding learning error for the distribution $\bar{P}_{XY}\in\mathcal{N}(\epsilon)$ is

[TABLE]

Therefore, for any $t\in(0,1)$ , there exists an $\epsilon_{0}>0$ such that for all $\epsilon\in(0,\epsilon_{0})$ , we have

[TABLE]

Then, using arguments similar to (99)–(101), from Lemma 13 we have

[TABLE]

Finally, following the same proof as that for Theorem 4, we obtain (62).

Appendix S Proof of Corollary 2

From Lemma 10, we have $\sigma_{1}=\dots=\sigma_{d-1}>\sigma_{d}=0$ . Therefore, for all $1\leq k\leq d-1$ we have $\mathcal{I}_{k}=[d-1]$ , which further implies that

[TABLE]

Hence, from (58) we have

[TABLE]

In addition, similar to (119), we have

[TABLE]

and thus

[TABLE]

where $\hat{\mathbf{M}}(r)$ is as defined in (180). Note that since $\left\langle\hat{\mathbf{M}}(r)\bar{\bm{\varphi}}_{i},\hat{\mathbf{M}}(r)\bar{\bm{\varphi}}_{j}\right\rangle=\delta_{ij}$ , (190) demonstrates the eigen-decomposition of $\bar{\mathbf{G}}_{k}$ . Therefore, from Theorem 6 and the definition of $\bar{\beta}(r)$ , we have

[TABLE]

To prove that the inequality holds with equality, it suffices to construct $\bm{\Gamma}$ and $\bm{\zeta}$ such that the corresponding $\bm{\varsigma}$ as defined in (61) satisfies $\|\bm{\varsigma}\|^{2}\leq 1$ and

[TABLE]

Indeed, as we now illustrate, if $\bm{\Gamma}$ and $\bm{\zeta}$ are chosen as

[TABLE]

with $\bm{\phi}^{\prime}$ as defined in (135), then we have $\bar{\bm{\varphi}}_{1}=\pm\bm{\phi}^{\prime}$ and $\bm{\varsigma}=\hat{\mathbf{M}}(r)\bm{\phi}^{\prime}=\pm\hat{\mathbf{M}}(r)\bar{\bm{\varphi}}_{1}$ , and thus (191) holds.

To see this, first note that from (173) we have

[TABLE]

and it follows from (61) and (192) that $\bm{\varsigma}=\hat{\mathbf{M}}(r)\bm{\phi}^{\prime}$ . Therefore, we have $\|\bm{\varsigma}\|^{2}=\|\bm{\phi}^{\prime}\|^{2}=1$ .

In addition, from (52) we have

[TABLE]

i.e.,

[TABLE]

Then, similar to (142), from (51) we obtain

[TABLE]

with $\Xi(y,x)$ as given by (142). Furthermore, following the same proof as that for Corollary 1, $\bar{\bm{\varphi}}_{1}$ is the solution of the optimization problem

[TABLE]

which has the same solution as the optimization problem (143) since $\bar{\bm{\Xi}}=\bm{\Xi}/\sqrt{1+r}$ . Hence, we obtain $\bar{\bm{\varphi}}_{1}=\pm\bm{\phi}\textquoteright$ , which finishes the proof.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence , vol. 35, no. 8, pp. 1798–1828, 2013.
2[2] H. O. Hirschfeld, “A connection between correlation and contingency,” Proc. Cambridge Phil. Soc. , vol. 31, pp. 520–524, 1935.
3[3] H. Gebelein, “Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichungsrechnung,” Z. für angewandte Math., Mech. , vol. 21, pp. 364–379, 1941.
4[4] A. Rényi, “On measures of dependence,” Acta Mathematica Academiae Scientiarum Hungarica , vol. 10, no. 3–4, pp. 441–451, 1959.
5[5] C. Bell, “Mutual information and maximal correlation as measures of dependence,” The Annals of Mathematical Statistics , pp. 587–595, 1962.
6[6] R. Ahlswede and P. Gács, “Spreading of sets in product spaces and hypercontraction of the markov operator,” The annals of probability , pp. 925–939, 1976.
7[7] D. Lopez-Paz, P. Hennig, and B. Schölkopf, “The randomized dependence coefficient,” in Advances in neural information processing systems , 2013, pp. 1–9.
8[8] A. Makur, F. Kozynski, S.-L. Huang, and L. Zheng, “An efficient algorithm for information decomposition and extraction,” in Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on . IEEE, 2015, pp. 972–979.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

On the Sample Complexity of HGR Maximal Correlation Functions for Large Datasets

Abstract

Index Terms:

I Introduction

II Problem Formulation

III The Matrix Perturbation Analyses

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

IV The Supervised Learning

IV-A The Sample Complexity for the Case σk>σk+1\sigma_{k}>\sigma_{k+1}σk​>σk+1​

Definition 1**.**

Theorem 1**.**

Theorem 2**.**

Proof.

Remark 1**.**

Proposition 1**.**

Proof.

Definition 2**.**

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Proof of Theorem 1.

IV-B The Sample Complexity for the Case σk=σk+1\sigma_{k}=\sigma_{k+1}σk​=σk+1​

Definition 3**.**

Theorem 3**.**

Proof.

Corollary 1**.**

Proof.

IV-C Remarks on the General Trend of Error Exponent

V The Semi-supervised Learning

V-A The Sample Complexity for the Case σk>σk+1\sigma_{k}>\sigma_{k+1}σk​>σk+1​

Definition 4**.**

Theorem 4**.**

Theorem 5**.**

Proof.

Proposition 2**.**

Proof.

Proposition 3**.**

Proof.

Definition 5**.**

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Lemma 7**.**

Proof.

Proof of Theorem 4.

V-B The Sample Complexity for the Case σk=σk+1\sigma_{k}=\sigma_{k+1}σk​=σk+1​

Definition 6**.**

Theorem 6**.**

Proof.

Corollary 2**.**

Proof.

V-C The Optimal Number of Samples with the Cost Constraint

Proposition 4**.**

VI The Numerical Simulations

VI-A Supervised Learning

VI-B Semi-supervised Learning

Appendix A Alternating Conditional Expectation Algorithm (Algorithm 1)

Appendix B Proof of Eq. (9)

Appendix C Proof of Lemma 1

Appendix D Proof of Lemma 2

Appendix E Proof of Lemma 3

Appendix F Proof of Lemma 4

Lemma 8**.**

Proof of Lemma 8.

Appendix G Proof of Theorem 2

Appendix H Proof of Proposition 1

Appendix I Proof of Theorem 3

Lemma 9**.**

Proof.

Lemma 1.

Lemma 2.

IV-A The Sample Complexity for the Case $\sigma_{k}>\sigma_{k+1}$

Definition 1.

Theorem 1.

Theorem 2.

Remark 1.

Proposition 1.

Definition 2.

Lemma 3.

Lemma 4.

IV-B The Sample Complexity for the Case $\sigma_{k}=\sigma_{k+1}$

Definition 3.

Theorem 3.

Corollary 1.

V-A The Sample Complexity for the Case $\sigma_{k}>\sigma_{k+1}$

Definition 4.

Theorem 4.

Theorem 5.

Proposition 2.

Proposition 3.

Definition 5.

Lemma 5.

Lemma 6.

Lemma 7.

V-B The Sample Complexity for the Case $\sigma_{k}=\sigma_{k+1}$

Definition 6.

Theorem 6.

Corollary 2.

Proposition 4.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.