A multivariate Riesz basis of ReLU neural networks

Cornelia Schneider; Jan Vyb\'iral

arXiv:2303.00076·cs.IT·March 2, 2023

A multivariate Riesz basis of ReLU neural networks

Cornelia Schneider, Jan Vyb\'iral

PDF

Open Access

TL;DR

This paper proves that a certain system of piecewise linear functions forms a Riesz basis for multivariate $L_2$ spaces, which can be efficiently represented by neural networks and is dimension-independent.

Contribution

It provides an alternative proof of the Riesz basis property and generalizes the system to higher dimensions without tensor products, facilitating neural network representations.

Findings

01

The system forms a Riesz basis of $L_2([0,1])$.

02

The basis generalizes to higher dimensions $d>1$.

03

Riesz constants are independent of the dimension $d$.

Abstract

We consider the trigonometric-like system of piecewise linear functions introduced recently by Daubechies, DeVore, Foucart, Hanin, and Petrova. We provide an alternative proof that this system forms a Riesz basis of $L_{2} ([0, 1])$ based on the Gershgorin theorem. We also generalize this system to higher dimensions $d > 1$ by a construction, which avoids using (tensor) products. As a consequence, the functions from the new Riesz basis of $L_{2} ([0, 1]^{d})$ can be easily represented by neural networks. Moreover, the Riesz constants of this system are independent of $d$ , making it an attractive building block regarding future multivariate analysis of neural networks.

Figures10

Click any figure to enlarge with its caption.

Equations196

A n \sum α_{n}^{2} \leq n \sum α_{n} x_{n}^{2} \leq B n \sum α_{n}^{2}

A n \sum α_{n}^{2} \leq n \sum α_{n} x_{n}^{2} \leq B n \sum α_{n}^{2}

C (x) = 4 x - \frac{1}{2} - 1 = {1 - 4 x, x \in [0, 1/2), 4 x - 3, x \in [1/2, 1]

C (x) = 4 x - \frac{1}{2} - 1 = {1 - 4 x, x \in [0, 1/2), 4 x - 3, x \in [1/2, 1]

S (x) = 2 - 4 x - \frac{1}{4} - 1 = ⎩ ⎨ ⎧ 4 x, 2 - 4 x, 4 x - 4, x \in [0, 1/4), x \in [1/4, 3/4), x \in [3/4, 1] .

S (x) = 2 - 4 x - \frac{1}{4} - 1 = ⎩ ⎨ ⎧ 4 x, 2 - 4 x, 4 x - 4, x \in [0, 1/4), x \in [1/4, 3/4), x \in [3/4, 1] .

\overline{R}_{1} := {1} \cup {3 C_{k}, 3 S_{k} : k \in N}

\overline{R}_{1} := {1} \cup {3 C_{k}, 3 S_{k} : k \in N}

ReLU (x_{1}, \dots, x_{n}) = (ReLU (x_{1}), \dots, ReLU (x_{n})), x \in R^{n} .

ReLU (x_{1}, \dots, x_{n}) = (ReLU (x_{1}), \dots, ReLU (x_{n})), x \in R^{n} .

\{1\}\cup\{\sqrt{3}\,{\mathcal{C}}(\alpha\cdot x):\alpha\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$+$}}\makebox[7.7778pt]{$>$}}}0\}\cup\{\sqrt{3}\,\mathcal{S}(\alpha\cdot x):\alpha\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$+$}}\makebox[7.7778pt]{$>$}}}0\},

\{1\}\cup\{\sqrt{3}\,{\mathcal{C}}(\alpha\cdot x):\alpha\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$+$}}\makebox[7.7778pt]{$>$}}}0\}\cup\{\sqrt{3}\,\mathcal{S}(\alpha\cdot x):\alpha\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$+$}}\makebox[7.7778pt]{$>$}}}0\},

{C_{k}, S_{k} : k \in N},

{C_{k}, S_{k} : k \in N},

3 \cdot ⟨ C_{i}, C_{j} ⟩ = 3 \cdot ∣ ⟨ S_{i}, S_{j} ⟩ ∣ = \frac{g cd ( i , j ) ^{4}}{i ^{2} \cdot j ^{2}} .

3 \cdot ⟨ C_{i}, C_{j} ⟩ = 3 \cdot ∣ ⟨ S_{i}, S_{j} ⟩ ∣ = \frac{g cd ( i , j ) ^{4}}{i ^{2} \cdot j ^{2}} .

c_{k} (x) = 2 cos (2 π k x), s_{k} (x) = 2 sin (2 π k x), k \in N_{0}, x \in R .

c_{k} (x) = 2 cos (2 π k x), s_{k} (x) = 2 sin (2 π k x), k \in N_{0}, x \in R .

3 C_{k} = μ m \geq 0 \sum \frac{1}{( 2 m + 1 ) ^{2}} c_{(2 m + 1) k} and 3 S_{k} = μ m \geq 0 \sum \frac{( - 1 ) ^{m}}{( 2 m + 1 ) ^{2}} s_{(2 m + 1) k},

3 C_{k} = μ m \geq 0 \sum \frac{1}{( 2 m + 1 ) ^{2}} c_{(2 m + 1) k} and 3 S_{k} = μ m \geq 0 \sum \frac{( - 1 ) ^{m}}{( 2 m + 1 ) ^{2}} s_{(2 m + 1) k},

μ^{2} m \geq 0 \sum \frac{1}{( 2 m + 1 ) ^{4}} = 1, i.e. μ^{2} \frac{π ^{4}}{96} = 1.

μ^{2} m \geq 0 \sum \frac{1}{( 2 m + 1 ) ^{4}} = 1, i.e. μ^{2} \frac{π ^{4}}{96} = 1.

3 ⟨ C_{i}, C_{j} ⟩

3 ⟨ C_{i}, C_{j} ⟩

= m, n = 0 \sum \infty \frac{μ ^{2}}{( 2 m + 1 ) ^{2} ( 2 n + 1 ) ^{2}} δ_{(2 m + 1) i, (2 n + 1) j},

(2 m + 1) \cdot g \cdot \frac{i}{g} = (2 n + 1) \cdot g \cdot \frac{j}{g} .

(2 m + 1) \cdot g \cdot \frac{i}{g} = (2 n + 1) \cdot g \cdot \frac{j}{g} .

2 m + 1 = \frac{j}{g} \cdot (2 l + 1), 2 n + 1 = \frac{i}{g} \cdot (2 l + 1), l \in N_{0} .

2 m + 1 = \frac{j}{g} \cdot (2 l + 1), 2 n + 1 = \frac{i}{g} \cdot (2 l + 1), l \in N_{0} .

3 ⟨ C_{i}, C_{j} ⟩

3 ⟨ C_{i}, C_{j} ⟩

= μ^{2} l = 0 \sum \infty \frac{1}{( 2 l + 1 ) ^{4}} \cdot \frac{1}{( i / g ) ^{2} \cdot ( j / g ) ^{2}} = \frac{1}{( i / g ) ^{2} \cdot ( j / g ) ^{2}} .

3 ⟨ S_{i}, S_{j} ⟩ = m, n = 0 \sum \infty \frac{μ ^{2} ( - 1 ) ^{n + m}}{( 2 m + 1 ) ^{2} ( 2 n + 1 ) ^{2}} δ_{(2 m + 1) i, (2 n + 1) j},

3 ⟨ S_{i}, S_{j} ⟩ = m, n = 0 \sum \infty \frac{μ ^{2} ( - 1 ) ^{n + m}}{( 2 m + 1 ) ^{2} ( 2 n + 1 ) ^{2}} δ_{(2 m + 1) i, (2 n + 1) j},

i = 1 \sum N α_{i} x_{i}^{2} = i, j = 1 \sum N α_{i} α_{j} ⟨ x_{i}, x_{j} ⟩ = α^{T} G α,

i = 1 \sum N α_{i} x_{i}^{2} = i, j = 1 \sum N α_{i} α_{j} ⟨ x_{i}, x_{j} ⟩ = α^{T} G α,

σ (G) \subset i = 1 ⋃ N g_{i, i} - j \neq = i \sum ∣ g_{i, j} ∣, g_{i, i} + j \neq = i \sum ∣ g_{i, j} ∣ .

σ (G) \subset i = 1 ⋃ N g_{i, i} - j \neq = i \sum ∣ g_{i, j} ∣, g_{i, i} + j \neq = i \sum ∣ g_{i, j} ∣ .

G_{N} = 1 [2 pt /2 pt] \mbox 0 [2 pt /2 pt] \mbox 0 \dashrule [- 0.6 e x] 0.4 31.531.53 \dashrule [- 9 e x] 0.4 31.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.53 \dashrule [- 9 e x] 0.4 31.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.53 \mbox 0 3 ⟨ C_{1}, C_{1} ⟩ = 1 ⋮ 3 ⟨ C_{N}, C_{1} ⟩ \dots ⋱ \dots 3 ⟨ C_{1}, C_{N} ⟩ ⋮ = 1 3 ⟨ C_{N}, C_{N} ⟩ \mbox 0 \dashrule [- 0.6 e x] 0.4 31.531.53 \dashrule [- 9 e x] 0.4 31.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.53 \dashrule [- 9 e x] 0.4 31.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.53 \mbox 0 \mbox 0 3 ⟨ S_{1}, S_{1} ⟩ = 1 ⋮ 3 ⟨ S_{N}, S_{1} ⟩ \dots ⋱ \dots 3 ⟨ S_{1}, S_{N} ⟩ ⋮ = 1 3 ⟨ S_{N}, S_{N} ⟩ .

G_{N} = 1 [2 pt /2 pt] \mbox 0 [2 pt /2 pt] \mbox 0 \dashrule [- 0.6 e x] 0.4 31.531.53 \dashrule [- 9 e x] 0.4 31.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.53 \dashrule [- 9 e x] 0.4 31.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.53 \mbox 0 3 ⟨ C_{1}, C_{1} ⟩ = 1 ⋮ 3 ⟨ C_{N}, C_{1} ⟩ \dots ⋱ \dots 3 ⟨ C_{1}, C_{N} ⟩ ⋮ = 1 3 ⟨ C_{N}, C_{N} ⟩ \mbox 0 \dashrule [- 0.6 e x] 0.4 31.531.53 \dashrule [- 9 e x] 0.4 31.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.53 \dashrule [- 9 e x] 0.4 31.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.531.53 \mbox 0 \mbox 0 3 ⟨ S_{1}, S_{1} ⟩ = 1 ⋮ 3 ⟨ S_{N}, S_{1} ⟩ \dots ⋱ \dots 3 ⟨ S_{1}, S_{N} ⟩ ⋮ = 1 3 ⟨ S_{N}, S_{N} ⟩ .

i = q_{1}^{α_{1}} \dots q_{n}^{α_{n}}

i = q_{1}^{α_{1}} \dots q_{n}^{α_{n}}

⟨ 3 C_{i}, 1 ⟩ + j = 1 \sum N 3 \cdot ⟨ C_{i}, C_{j} ⟩ + j = 1 \sum N 3 \cdot ⟨ C_{i}, S_{j} ⟩ = 1 \leq j \leq N, j odd \sum 3 \cdot ⟨ C_{i}, C_{j} ⟩ \leq j \in N, j odd \sum 3 \cdot ⟨ C_{i}, C_{j} ⟩ .

⟨ 3 C_{i}, 1 ⟩ + j = 1 \sum N 3 \cdot ⟨ C_{i}, C_{j} ⟩ + j = 1 \sum N 3 \cdot ⟨ C_{i}, S_{j} ⟩ = 1 \leq j \leq N, j odd \sum 3 \cdot ⟨ C_{i}, C_{j} ⟩ \leq j \in N, j odd \sum 3 \cdot ⟨ C_{i}, C_{j} ⟩ .

g cd (i, j) = u = 1 \prod n q_{u}^{m i n (α_{u}, β_{u})} .

g cd (i, j) = u = 1 \prod n q_{u}^{m i n (α_{u}, β_{u})} .

j \in N, j odd \sum 3 \cdot ⟨ C_{i}, C_{j} ⟩

j \in N, j odd \sum 3 \cdot ⟨ C_{i}, C_{j} ⟩

= J \in N, J odd g c d (J, i) = 1 \sum \frac{1}{J ^{2}} \cdot β_{1} = 0 \sum \infty \frac{1}{[ q _{1}^{α_{1} + β_{1} - 2 m i n (α_{1}, β_{1})} ] ^{2}} \dots β_{n} = 0 \sum \infty \frac{1}{[ q _{n}^{α_{n} + β_{n} - 2 m i n (α_{n}, β_{n})} ] ^{2}} .

J \in N, J odd g c d (J, i) = 1 \sum \frac{1}{J ^{2}}

J \in N, J odd g c d (J, i) = 1 \sum \frac{1}{J ^{2}}

β = 0 \sum \infty \frac{1}{[ q ^{α + β - 2 m i n (α, β)} ] ^{2}}

β = 0 \sum \infty \frac{1}{[ q ^{α + β - 2 m i n (α, β)} ] ^{2}}

\leq r = 0 \sum \infty \frac{1}{q ^{2 r}} + r = 1 \sum \infty \frac{1}{q ^{2 r}} = (1 + 1/ q^{2}) \cdot \frac{1}{1 - 1/ q ^{2}} .

\eqref e q : C_{s} u m

\eqref e q : C_{s} u m

p prime \prod \frac{1 + 1/ p ^{2}}{1 - 1/ p ^{2}} = \frac{5}{2} .

p prime \prod \frac{1 + 1/ p ^{2}}{1 - 1/ p ^{2}} = \frac{5}{2} .

j \neq = i \sum ∣ (G_{N})_{i, j} ∣ = j \neq = i \sum 3 ⟨ C_{i}, C_{j} ⟩ = j = 1 \sum N 3 ⟨ C_{i}, C_{j} ⟩ - 3 ⟨ C_{i}, C_{i} ⟩ \leq \frac{3}{2} - 1 = \frac{1}{2} .

j \neq = i \sum ∣ (G_{N})_{i, j} ∣ = j \neq = i \sum 3 ⟨ C_{i}, C_{j} ⟩ = j = 1 \sum N 3 ⟨ C_{i}, C_{j} ⟩ - 3 ⟨ C_{i}, C_{i} ⟩ \leq \frac{3}{2} - 1 = \frac{1}{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Computational Physics and Python Applications · Neural Networks and Applications

Full text

A multivariate Riesz basis of ReLU neural networks

Cornelia Schneider111Friedrich-Alexander Universität Erlangen, Applied Mathematics III, Cauerstr. 11, 91058 Erlangen, Germany. Email: [email protected] and Jan Vybíral222Department of Mathematics, Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University, Trojanova 13, 12000 Praha, Czech Republic. Email: [email protected]. The work of this author has been supported by the grant P202/23/04720S of the Grant Agency of the Czech Republic

Abstract

We consider the trigonometric-like system of piecewise linear functions introduced recently by Daubechies, DeVore, Foucart, Hanin, and Petrova. We provide an alternative proof that this system forms a Riesz basis of $L_{2}([0,1])$ based on the Gershgorin theorem. We also generalize this system to higher dimensions $d>1$ by a construction, which avoids using (tensor) products. As a consequence, the functions from the new Riesz basis of $L_{2}([0,1]^{d})$ can be easily represented by neural networks. Moreover, the Riesz constants of this system are independent of $d$ , making it an attractive building block regarding future multivariate analysis of neural networks.

Key Words: Riesz basis, Rectified Linear Unit (ReLU), artificial neural networks, Euler product, Möbius function

MSC2020 Math Subject Classifications: 68T07, 42C15, 11A25.

1 Introduction

The last decades observed a tremendous success of artificial neural networks in many machine learning tasks, including computer vision [16], speech recognition [11], natural language processing [24], or games solutions [17, 21] to name just few. Despite their wide use, many of their properties are not fully understood and many aspects of their great practical performance lack a rigorous explanation. Without any doubt, a deeper insight into the theory of artificial neural networks could boost their applicability even further.

In the last decade, a growing number of authors investigated why the deep neural networks with a higher number of hidden layers approximate many interesting functions more efficiently than the shallow neural networks (with only one hidden layer) using the same number of parameters. We refer to [1, 7, 8, 10, 12, 18, 22, 25] for a number of mathematically rigorous results in this direction.

One of the astonishing properties of artificial neural networks is, that they can approximate extremely well also functions of many variables, which often allows to avoid the curse of dimensionality [3, 9, 19]. The aim of this work is to shed new light on the effectiveness of artificial neural networks for the approximation of multivariate functions by constructing a new system of functions, which forms a Riesz basis of $L_{2}([0,1]^{d})$ for every $d\geq 1$ .

To state the result, we first recall the notion of a Riesz basis (which, in turn, is a generalization of an orthonormal system and of an orthonormal basis, cf. [6]).

Definition 1.1.

Let $H$ be a real Hilbert space. The (finite or infinite) sequence $(x_{n})_{n}\subset H$ is called a Riesz sequence if there are two constants $A,B>0$ such that

[TABLE]

for every real square summable sequence $(\alpha_{n})_{n}$ . If the closed span of $(x_{n})_{n}$ is the whole space $H$ , then we call it a Riesz basis.

The system, which we study in this paper, is a trigonometric-like basis, where instead of $\cos$ and $\sin$ functions we use their piecewise linear counterparts $\mathcal{C}$ and $\mathcal{S}$ , which are defined as follows (cf. Figure 1).

Definition 1.2.

For $x\in[0,1]$ , we define

[TABLE]

and

[TABLE] 2. 2.

For $x\in\mathbb{R}$ , we extend this definition periodically, i.e. $\mathcal{C}(x)=\mathcal{C}(x-\lfloor x\rfloor)$ and $\mathcal{S}(x)=\mathcal{S}(x-\lfloor x\rfloor).$ 3. 3.

If $k\geq 1$ and $x\in\mathbb{R}$ , we put $\mathcal{C}_{k}(x)=\mathcal{C}(kx)$ and $\mathcal{S}_{k}(x)=\mathcal{S}(kx).$

These functions were introduced and studied in [7], where it was shown that the system

[TABLE]

forms a Riesz basis of $L_{2}([0,1])$ with the constants $A=1/2$ and $B=3/2.$ Let us note that the factor $\sqrt{3}$ in (2) is simply a normalization factor, which ensures that all the elements of ${\overline{\mathcal{R}}}_{1}$ have unit norm in $L_{2}([0,1])$ . The proof given in [7] is implicitly inspired by the method of analysis and synthesis operators, respectively, used in the frame theory, cf. [13]. It is one of the aims of our paper (cf. Theorem 2.2) to provide an alternative proof, which first reduces (1) to the study of spectral properties of the Gram matrix of ${\overline{\mathcal{R}}}_{1}$ . The result then follows from the Gershgorin circle theorem and some elementary number theory (including Euler products and a certain Ramanujan’s formula).

The main advantage of (2) in contrast to the standard trigonometric system is, that its elements can be easily identified by artificial neural networks with the REctified Linear Unit (ReLU) activation function.

Let us recall, that if $t\in\mathbb{R}$ , then the ReLU function is defined as $\operatorname{ReLU}(t)=\max(0,t)$ , cf. Figure 2. On vectors, it acts component-wise

[TABLE]

If $W\geq 2$ and $L\geq 1$ are integer parameters, then we denote by $\Upsilon^{W,L}$ the real-valued functions which can be represented by a $\operatorname{ReLU}$ neural network of width $W$ and depth $L$ , see Definition 4.1 for a precise formulation. Then [7, Theorem 6.2] shows that $\mathcal{C}_{j}$ and $\mathcal{S}_{j}$ (restricted to $[0,1]$ ) lie in $\Upsilon^{W,L}$ for an arbitrary $W\geq 6$ and $L$ of the asymptotic order $\log_{2}(j).$

The main aim of our work is to generalize the results of [7] to the multivariate case. There are two crucial issues which prevent us from simply taking the tensor products of the functions in ${\overline{\mathcal{R}}}_{1}$ . First, the product function $(x,y)\to x\cdot y$ can only be approximated by the $\operatorname{ReLU}$ neural networks and, second, the ratio of the Riesz constants $B$ and $A$ gets exponentially large when $d$ grows.

We propose a surprisingly simple and effective solution to these challenges. We show that (cf. Theorem 3.3) the multivariate analogue of ${\overline{\mathcal{R}}}_{1}$

[TABLE]

forms a Riesz basis of $L_{2}([0,1]^{d})$ for every $d\geq 1$ with the same constants $A=1/2$ and $B=3/2$ as in the univariate case. Here, $\alpha\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$ + $}}\makebox[7.7778pt]{$ > $}}}0$ means that the first non-zero entry of $\alpha=(\alpha_{1},\dots,\alpha_{d})\in\mathbb{Z}^{d}$ is positive. Finally, we show in Section 4 that the functions from (3) can be exactly reproduced by the $\operatorname{ReLU}$ neural networks of width $W$ and depth $L$ in essentially the same way as in the univariate case, i.e., there is virtually no price to pay when $d$ grows.

2 Univariate case

It was observed in [7, Section 6], that the system of piecewise linear functions

[TABLE]

on one hand shares some nice properties with the trigonometric system and on the other hand can be easily reproduced by artificial neural networks with the $\operatorname{ReLU}$ activation function. The aim of this section is to essentially reprove Proposition 6.1 of [7], which states that this system is a Riesz basis of $L_{2}^{0}([0,1])$ , the space of square integrable functions with mean zero.

Although our proof shares some technical details with [7], its main structure is different: In particular, we reduce the problem to spectral properties of the corresponding Gram matrix and then apply the Gershgorin circle theorem. Interestingly, by using some elementary number theory, we are able to completely characterize the inner products of the $\mathcal{C}_{i}$ and/or $\mathcal{S}_{j}$ functions. In contrast to [7], we complement (4) by adding the constant function, which is orthogonal to all functions from (4).

We denote by $\gcd(i,j)$ the greatest common divisor of $i$ and $j$ and by $\langle f,g\rangle=\int_{0}^{1}f(t)g(t)dt$ the standard inner product in $L_{2}([0,1]).$

Lemma 2.1.

Let $i,j\in\mathbb{N}.$ Then

$\langle\mathcal{C}_{i},\mathcal{S}_{j}\rangle=0$ ; 2. 2.

$\langle{\mathcal{C}}_{i},{\mathcal{C}}_{j}\rangle=\langle\mathcal{S}_{i},\mathcal{S}_{j}\rangle=0$ * if $i/\gcd(i,j)$ is odd and $j/\gcd(i,j)$ is even (or vice versa), i.e., if the prime factorizations of $i$ and $j$ contain a different power of 2;* 3. 3.

If $i/\gcd(i,j)$ and $j/\gcd(i,j)$ are both odd, then

[TABLE]

Here, the sign of $\langle{\mathcal{S}}_{i},{\mathcal{S}}_{j}\rangle$ is negative if, and only if, $(i+j)/(2\gcd(i,j))$ is even. 4. 4.

In particular, we get $\langle{\mathcal{C}}_{i},{\mathcal{C}}_{i}\rangle=\langle\mathcal{S}_{i},\mathcal{S}_{i}\rangle=1/3$ for all $i\in\mathbb{N}$ .

Proof.

We transfer the proof to the Fourier side by exploiting the decomposition of $\mathcal{C}_{i}$ and $\mathcal{S}_{j}$ into Fourier series, cf. [7, page 166]. Let

[TABLE]

Then a standard calculation reveals that

[TABLE]

where

[TABLE]

Using (5), we immediately obtain that $\langle\mathcal{C}_{i},\mathcal{S}_{j}\rangle=0$ . Furthermore,

[TABLE]

where $\delta_{u,v}=1$ if $u=v$ and zero otherwise. To simplify (7), we have to find for fixed $i,j\in\mathbb{N}$ all $m,n\in\mathbb{N}_{0}$ such that $(2m+1)i=(2n+1)j$ . First, we observe that if the prime factorizations of $i$ and $j$ contain a different power of two, then also $(2m+1)i$ and $(2n+1)j$ have a different power of two in their prime factorizations and therefore they differ for all $m,n\in\mathbb{N}_{0}$ . Consequently, (7) shows that $3\langle{{\mathcal{C}}_{i}},{{\mathcal{C}}_{j}}\rangle=0$ .

If the prime factorizations of $i$ and $j$ contain the same power of two, then $i/\gcd(i,j)$ and $j/\gcd(i,j)$ are both odd. We denote $g=\gcd(i,j)$ and note that $i/g$ and $j/g$ are coprime, i.e., that their greatest common divisor is one. We then look for all pairs $(m,n)\in\mathbb{N}^{2}_{0}$ , which solve the equation

[TABLE]

All the solutions are obtained in the form

[TABLE]

We insert (8) into (7) and conclude that

[TABLE]

The calculation of $\langle\mathcal{S}_{i},\mathcal{S}_{j}\rangle$ can be performed in a very similar way, one only needs to take care about the sign of the inner product. In particular, instead of (7) we obtain

[TABLE]

where $(-1)^{n+m}=-1$ if, and only if, $n+m=\frac{i+j}{2g}(2l+1)-1$ is odd. Therefore, the sign is negative if, and only if, $\frac{i+j}{2g}$ even. ∎

Next, we combine Lemma 2.1 with the Gershgorin circle theorem and provide an alternative proof of [7, Theorem 6.2], which we restate as follows.

Theorem 2.2.

The system ${\mathcal{R}}_{1}:=\{1\}\cup\{\mathcal{C}_{k},\mathcal{S}_{k}:k\in\mathbb{N}\}$ is a Riesz basis of $L_{2}([0,1])$ .

Before we come to the proof, several remarks seem to be in order.

Remark 2.3.

We will actually show, that the Riesz constants of the $L_{2}$ -normalized system ${\overline{\mathcal{R}}}_{1}:=\{1\}\cup\{\sqrt{3}\,\mathcal{C}_{k},\sqrt{3}\,\mathcal{S}_{k}:k\in\mathbb{N}\}$ can be chosen as $A=1/2$ and $B=3/2.$ 2. 2.

We divide the proof of Theorem 2.2 into several steps. In the first two steps, we show that the truncated system ${\overline{\mathcal{R}}}_{1}^{N}:=\{1\}\cup\{\sqrt{3}\,\mathcal{C}_{k},\sqrt{3}\,\mathcal{S}_{k}:k\leq N\}$ forms a Riesz sequence (i.e., that it satisfies (1)) with $A=1/2$ and $B=3/2$ being independent of $N\in\mathbb{N}$ . The first step reduces this question to spectral properties of the Gram matrix of ${\overline{\mathcal{R}}}_{1}^{N}$ and in the second step we apply the Gershgorin theorem to bound this spectrum. The third step describes how we pass to the limit $N\to\infty$ to deduce that ${\overline{\mathcal{R}}}_{1}$ is also a Riesz sequence. Finally, the fourth step shows that ${\overline{\mathcal{R}}}_{1}$ is also a basis, i.e., that its closed linear span is $L_{2}([0,1])$ . 3. 3.

Our proof shows that the spectrum of the Gram matrix of ${\overline{\mathcal{R}}}_{1}^{N}$ (for arbitrary $N\in\mathbb{N}$ ) is contained in $[1/2,3/2].$ We leave it as an open problem to find out if these bounds are actually optimal. Supported by numerical evidence (see Figure 3 for details), our conjecture is that there is indeed some space for improvement.

Proof of Theorem 2.2: Step 1.

First we reformulate the definition of a (finite) Riesz sequence as an eigenvalue problem of its Gram matrix. This reformulation is rather straightforward and by no means new, see [15] or [26, Chapter 1.8]. Let $H$ be a real Hilbert space and let $\{x_{i}\}_{i=1}^{N}\subset H$ . Then, for every $\alpha=(\alpha_{1},\dots,\alpha_{N})^{T}\in\mathbb{R}^{N}$

[TABLE]

where $G=(g_{i,j})_{i,j=1}^{N}$ with $g_{i,j}=\langle x_{i},x_{j}\rangle$ is the Gram matrix of $\{x_{i}\}_{i=1}^{N}$ . Therefore, (1) is equivalent to $A\alpha^{T}\alpha\leq\alpha^{T}G\alpha\leq B\alpha^{T}\alpha$ for every $\alpha\in\mathbb{R}^{N}$ or simply to $\sigma(G)\subset[A,B].$ To show that this is indeed true for a given Gram matrix $G$ , we will use the Gershgorin circle theorem [14, Theorem 6.1.1], which states that

[TABLE]

Proof of Theorem 2.2: Step 2.

In this step, we show that $\{1\}\cup\{\sqrt{3}\,\mathcal{C}_{k},\sqrt{3}\,\mathcal{S}_{k}:k\leq N\}$ forms a Riesz sequence for every $N\in\mathbb{N}$ with the Riesz constants independent on $N$ . By Lemma 2.1, its Gram matrix $G_{N}\in\mathbb{R}^{(2N+1)\times(2N+1)}$ is a block matrix with three blocks. The first one is just a $1\times 1$ block corresponding to the constant function, the second and the third block are $N\times N$ matrices of the inner products $\big{(}3\langle\mathcal{C}_{i},\mathcal{C}_{j}\rangle\big{)}_{i,j=1}^{N}$ and $\big{(}3\langle\mathcal{S}_{i},\mathcal{S}_{j}\rangle\big{)}_{i,j=1}^{N}$ , respectively, i.e.,

[TABLE]

We apply the Gershgorin theorem to $G_{N}$ . Therefore, we need to estimate the row sums of $G_{N}$ . For the first row we see that $g_{1,1}=1$ and $g_{1,j}=0$ for all $j\neq 1$ .

Next we assume that $i\leq N$ is an odd number, i.e. that

[TABLE]

for some primes $q_{1},\dots,q_{n}\geq 3$ and integers $\alpha_{1},\dots,\alpha_{n}\geq 1$ . For the $(i+1)$ -th row ( $i=1,\ldots,N$ , corresponding to $\mathcal{C}_{i}$ ) we conclude

[TABLE]

Every odd $j$ can be written as $j=q_{1}^{\beta_{1}}\dots q_{n}^{\beta_{n}}\cdot J$ , where $\beta_{1},\dots,\beta_{n}\geq 0$ and $J$ is an odd integer, not divisible by any of $q_{1},\dots,q_{n}$ , i.e., with $\gcd(J,i)=1$ . Observe that with this notation

[TABLE]

Therefore, we can use Lemma 2.1 and rewrite (9) as

[TABLE]

Next, we simplify the individual terms.

[TABLE]

and

[TABLE]

Therefore,

[TABLE]

where in the last step we used the following Euler product [23, Page 5] attributed already to Ramanujan

[TABLE]

Therefore, we get

[TABLE]

If $i$ is even it follows from Lemma 2.1, assertions 2. and 3., that the estimates above remain the same since $3\cdot\langle{\mathcal{C}}_{i},{\mathcal{C}}_{j}\rangle=\frac{\gcd(i,j)^{4}}{i^{2}\cdot j^{2}}\neq 0$ only for $j$ ’s with the same power of $2$ in their prime factorization as $i$ , which then cancels out.

If we replace in (9) $\mathcal{C}_{i}$ by $\mathcal{S}_{i}$ (corresponding to the rows $(N+1)+i$ , $i=1,\ldots,N$ of the Gram matrix), we obtain instead the estimate

[TABLE]

The term on the right hand side of (12) can be bounded by (10) as before and we again obtain

[TABLE]

By Gershgorin’s theorem, we deduce $\sigma(G_{N})\subset\{1\}\cup[\frac{1}{2},\frac{3}{2}]=[\frac{1}{2},\frac{3}{2}].$

Proof of Theorem 2.2: Step 3.

The third step of the proof of Theorem 2.2, i.e., the passage to the limit $N\to\infty$ , is quite standard and straightforward (cf. [15] and [4, 5] for the so-called “projection method”) and is contained in the following lemma.

Lemma 2.4.

Let $H$ be a real Hilbert space and let $(x_{n})_{n=1}^{\infty}\subset H$ be an infinite sequence. If $\{x_{1},\dots,x_{N}\}$ is a Riesz sequence for every $N\in\mathbb{N}$ with Riesz constants $A$ and $B$ independent on $N$ , then $\{x_{1},x_{2},\dots\}$ is also a Riesz sequence with Riesz constants $A$ and $B$ .

Proof.

Let $\alpha=(\alpha_{1},\alpha_{2},\dots)$ be a square-summable sequence. Since

[TABLE]

the partial sums of $\sum_{n=1}^{\infty}\alpha_{n}x_{n}$ form a Cauchy sequence and therefore the series is convergent. Furthermore, by the triangle inequality

[TABLE]

Hence, we can take the limit $N\to\infty$ in

[TABLE]

and the result follows. ∎

Proof of Theorem 2.2: Step 4.

As the last step, we show that ${\mathcal{R}}_{1}$ is not only a Riesz sequence but also a Riesz basis, i.e., that its closed linear span is the whole space $L_{2}([0,1])$ . We rely on the fact that the trigonometric system

[TABLE]

forms an orthonormal basis of $L_{2}([0,1])$ . We show that every function from (14) lies in the closed linear span of ${\mathcal{R}}_{1}$ and, therefore, the closed linear span of ${\mathcal{T}}_{1}$ is contained in the closed linear span of ${\mathcal{R}}_{1}.$

We start with the following lemma, which gives an explicit decomposition of $\cos(2\pi x)$ and $\sin(2\pi x)$ in ${\mathcal{R}}_{1}$ . Its statement requires the notion of the Möbius function, which is defined for every positive integer $n\in\mathbb{N}$ as

[TABLE]

Lemma 2.5.

For every $l\in\mathbb{N}$ , let $\overline{\mathcal{C}}_{l}(x)=\sqrt{3}\,\mathcal{C}_{l}(x)/\mu$ and $\overline{\mathcal{S}}_{l}(x)=\sqrt{3}\,\mathcal{S}_{l}(x)/\mu$ , where $\mu$ is the constant from (6). Then

[TABLE]

and

[TABLE]

with the convergence being in $L_{2}([0,1])$ .

Proof.

We first reformulate (5) as

[TABLE]

where $\alpha_{2m+1}=1$ and $\alpha_{2m}=0$ . Note, that (17) converges in $L_{2}([0,1])$ .

We show that there is a unique bounded sequence $(\beta_{l})_{l=1}^{\infty}$ , such that

[TABLE]

with the convergence in $L_{2}([0,1])$ and that

[TABLE]

Using (17), we observe that (18) holds for bounded sequence $(\beta_{l})_{l=1}^{\infty}$ if, and only if,

[TABLE]

We compare the coefficients of $c_{k}(x)$ on both sides of (20) and observe that (20) is equivalent to the system of equations

[TABLE]

This system could be solved by using the Möbius inversion formula [23, p. 3], but one can also proceed directly. From $1=\beta_{1}\alpha_{1}$ we obtain $\beta_{1}=1$ and from $\beta_{1}\alpha_{2}+\beta_{2}\alpha_{1}=0$ we get $\beta_{2}=0$ . We show by induction that $\beta_{2n}=0$ also for all $n\geq 1$ . Let this be true for all integers smaller than $n$ . Then

[TABLE]

gives $\beta_{2n}=0$ as well. Similarly, from $0=\beta_{p}\alpha_{1}+\beta_{1}\alpha_{p}$ we get $\beta_{p}=-1$ for every prime $p\neq 2$ .

Let now $p=p_{1}p_{2}$ with odd primes $p_{1},p_{2}$ . Then $\beta_{p}=1$ follows from

[TABLE]

The formula for a general $p=p_{1}\cdot\ldots\cdot p_{k}$ , with distinct odd primes $p_{i}$ follows by induction. Observe that $p$ is divisible by all $p_{1}^{e_{1}}\cdot\ldots\cdot p_{k}^{e_{k}}$ with $e=(e_{1},\dots,e_{k})\in\{0,1\}^{k}$ . Hence

[TABLE]

We conclude, that $\beta_{n}=\mu(n)$ for every odd square-free integer $n$ .

Finally, we show that $\beta_{n}=0$ for every positive odd integer $n$ , which is not square-free. Therefore, we assume that $n=p_{1}^{u}\cdot q$ , where $p_{1}$ is an odd prime, $q$ is an odd integer not divisible by $p_{1}$ , $u\geq 2$ and that the statement is true for all integers smaller than $n$ . We obtain

[TABLE]

If $l<q$ is square-free, then the first two terms in this sum have values $+1$ and $-1$ , respectively, and the others vanish by the induction assumption. If $l<q$ is not square-free, then all the terms vanish again by assumption. Finally, if $l=q$ the same argument applies leaving us with $\beta_{n}=0$ . We conclude that the sequence $(\beta_{l})_{l=1}^{\infty}$ given by (19) indeed satisfies the system (21) which in turn gives (18).

Finally, (16) follows from (15) using the simple relation $\mathcal{S}(x)=\mathcal{C}(x-1/4).$ The factor $(-1)^{l}$ results from the relation

[TABLE]

∎

3 Multivariate case

The main aim of this section is to generalize Theorem 2.2 to higher dimensions $d\geq 1$ and to provide a Riesz basis of $L_{2}([0,1]^{d})$ , which is easily expressed by artificial neural networks with $\operatorname{ReLU}$ activation function. The most natural approach would be to consider the tensor products of the functions from ${\mathcal{R}}_{1}$ , i.e., a system of functions of the form $(x,y)\to\mathcal{C}_{k}(x)\cdot\mathcal{C}_{l}(y)$ etc. Indeed, it is quite easy to show that tensor products of elements of a Riesz sequence form again a Riesz sequence [2]. This approach is quite classical in analysis and there exist many multivariate bases and systems with a tensor product structure. Unfortunately, the Riesz constants of the tensor product system are in general given as products of the Riesz constants of the univariate Riesz sequences, cf. [2, Theorem 4.1]. Applying the tensor product construction to ${\mathcal{R}}_{1}$ would therefore lead to an exponential dependence of the ratio of the Riesz constants on the dimension.

Furthermore, the tensor product approach does not fit really well to artificial neural networks. The reason is that it is surprisingly difficult to construct a neural network, which for two real inputs $x$ and $y$ outputs the product $xy$ (or at least its approximation). In general, one first approximates the square function $t\to t^{2}$ and then applies the formula $xy=[(x+y)^{2}-(x-y)^{2}]/4.$ We refer to [10, 20, 25] for details. Therefore, we are looking for another multivariate Riesz basis of piecewise affine functions, which can be constructed without the use of (tensor) products, but where inner products with fixed vectors in $\mathbb{R}^{d}$ are allowed.

Before we state our results, we need some additional notation. If $\alpha=(\alpha_{1},\dots,\alpha_{d})\in\mathbb{Z}^{d}$ , we say that $\alpha\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$ + $}}\makebox[7.7778pt]{$ > $}}}0$ if the first non-zero entry of $\alpha$ is positive. The multivariate analogue of ${\mathcal{R}}_{1}$ is then defined simply as

[TABLE]

where we interpret the functions $\mathcal{C}$ and $\mathcal{S}$ as $1-$ periodic functions on $\mathbb{R}$ . Note that we need to restrict ourselves to indices $\alpha\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$ + $}}\makebox[7.7778pt]{$ > $}}}0$ here, since ${\mathcal{C}}(\alpha\cdot x)={\mathcal{C}}(-\alpha\cdot x)$ and ${\mathcal{S}}(\alpha\cdot x)=-{\mathcal{S}}(-\alpha\cdot x)$ , respectively.

Furthermore, we say that two non-zero $\alpha,\beta\in\mathbb{R}^{d}$ are co-linear if there is $t\not=0$ such that $\alpha=t\beta.$ Obviously, if $\alpha,\beta\in\mathbb{Z}^{d}$ with $\alpha,\beta\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$ + $}}\makebox[7.7778pt]{$ > $}}}0$ are co-linear, then $t>0$ . In the rest of this section, the inner product $\displaystyle\langle f,g\rangle=\int_{[0,1]^{d}}f(x)g(x)dx$ denotes the inner product in $L_{2}([0,1]^{d})$ , the space of real square integrable functions on $[0,1]^{d}$ .

The multivariate analogue of Lemma 2.1, which characterizes the inner products of the elements of ${\mathcal{R}}_{d}$ then looks as follows.

Lemma 3.1.

Let $\alpha,\beta\in\mathbb{Z}^{d}$ with $\alpha,\beta\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$ + $}}\makebox[7.7778pt]{$ > $}}}0$ . Then

$\langle\mathcal{C}(\alpha\cdot x),\mathcal{S}(\beta\cdot x)\rangle=0$ *. * 2. 2.

$\langle{\mathcal{C}}(\alpha\cdot x),{\mathcal{C}}(\beta\cdot x)\rangle=\langle{\mathcal{S}}(\alpha\cdot x),{\mathcal{S}}(\beta\cdot x)\rangle=0$ * if $\alpha$ and $\beta$ are not co-linear, or if $\alpha=t\beta$ , but $t$ can not be written as a ratio of two odd positive integers.* 3. 3.

If $\displaystyle\alpha=\frac{2p+1}{2q+1}\cdot\beta$ with coprime integers $2p+1$ and $2q+1$ (i.e., $\gcd(2p+1,2q+1)=1$ ), then

[TABLE]

and

[TABLE]

Remark 3.2.

Lemma 3.1 includes Lemma 2.1 as a special case. In particular, if $d=1$ then $\alpha$ and $\beta$ are always co-linear. Moreover, if the prime factorizations of $\alpha$ and $\beta$ contain different powers of $2$ , then we have $\alpha\neq\frac{2p+1}{2q+1}\beta$ for all integers $p,q$ .

Proof of Lemma 3.1.

First, we recall the elementary formulas

[TABLE]

and

[TABLE]

Next, we proceed to the proof of (22). Let us fix $\alpha,\beta\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$ + $}}\makebox[7.7778pt]{$ > $}}}0$ . We apply (5) followed by (24) and obtain

[TABLE]

Next, we discuss, when the last product vanishes for given $J$ . First, this happens if $\alpha_{j}=0$ or $\beta_{j}=0$ for any $j\in J$ . If $j\in J$ and both $\alpha_{j}$ and $\beta_{j}$ are non-zero, then the last product vanishes also if $(2m+1)|\alpha_{j}|\not=(2n+1)|\beta_{j}|$ . And finally, the product is zero also if $j\not\in J$ and $(2m+1)|\alpha_{j}|\not=(2n+1)|\beta_{j}|$ .

Equivalently, (26) is not equal to zero if $(2m+1)|\alpha_{j}|=(2n+1)|\beta_{j}|$ for every $j\in\{1,\dots,d\}$ and $\alpha_{j}$ and $\beta_{j}$ are different from zero if $j\in J$ . If we denote $|\alpha|=(|\alpha_{1}|,\dots,|\alpha_{n}|)$ (and similarly for $|\beta|$ ), we will therefore restrict ourselves for the rest of the proof to $m,n\geq 0$ with

[TABLE]

If there is no pair of integers $(m,n)\in\mathbb{N}_{0}^{2}$ , such that (27) holds, then $\langle{\mathcal{C}}(\alpha\cdot x),{\mathcal{C}}(\beta\cdot x)\rangle=0$ . Furthermore, we may consider only sets $J\subset\{1,\dots,d\}$ with an even number of elements, which are subsets of $\operatorname{supp}(\alpha)=\operatorname{supp}(\beta).$

If (27) holds, than the univariate integrals in (26) are equal to one for $j\not\in J$ and $j\in\operatorname{supp}(\alpha)$ . They are equal to two, if $j\not\in J$ and $j\not\in\operatorname{supp}(\alpha)$ . And if $j\in J\subset\operatorname{supp}(\alpha)$ , then the integral is +1 if $(2m+1)\alpha_{j}=(2n+1)\beta_{j}$ and it is equal to $-1$ if $(2m+1)\alpha_{j}=-(2n+1)\beta_{j}$ .

We denote $D=\{j:\operatorname{sign}(\alpha_{j})\cdot\operatorname{sign}(\beta_{j})=-1\}\subset\operatorname{supp}(\alpha)=\operatorname{supp}(\beta)$ , $\nu=\#D$ and $n=\#\operatorname{supp}(\alpha)\geq\nu$ . Using this notation, we obtain

[TABLE]

If $\nu\geq 1$ , we calculate

[TABLE]

where the last step follows since $\sum_{b=0}^{\nu}(-1)^{b}\binom{\nu}{b}=(1-1)^{\nu}=0$ . Hence, $\langle{\mathcal{C}}(\alpha\cdot x),{\mathcal{C}}(\beta\cdot x)\rangle=0$ if there exists $1\leq j\leq d$ with $\operatorname{sign}(\alpha_{j})\cdot\operatorname{sign}(\beta_{j})=-1$ and we arrive at

[TABLE]

The last sum is empty if $\alpha$ and $\beta$ are not co-linear or if we can not write $\alpha=t\beta$ , where $t>0$ is a ratio of two odd integers. Therefore, we assume that $\displaystyle\alpha=\frac{2p+1}{2q+1}\beta$ with $p,q\in\mathbb{N}_{0}$ and that $2p+1$ and $2q+1$ are coprime integers. All pairs $(m,n)\in\mathbb{N}_{0}^{2}$ with $(2m+1)\alpha=(2n+1)\beta$ are then of the form

[TABLE]

This finally leads to

[TABLE]

which combined with (6) gives (22). As a byproduct, we also showed that $\langle{\mathcal{C}}(\alpha\cdot x),{\mathcal{C}}(\beta\cdot x)\rangle=0$ if $\alpha$ and $\beta$ are not co-linear with a real factor $t$ , which can be written as a ration of two odd positive integers.

Applying the same idea to the inner product of $\mathcal{C}(\alpha\cdot x)$ and $\mathcal{S}(\beta\cdot x)$ , we get a double sum over $J,K\subset\{1,\dots,d\}$ with $\#J$ even and $\#K$ odd. Therefore, it is not possible to match the univariate integrands and their product always vanishes. Finally, (23) follows in the same way, the only essential difference being the $(-1)^{m+n}$ factor coming from (5). And an easy observation shows that under (28), the parity of $m+n$ is the same as the one of $p+q.$ ∎

We complement Lemma 3.1 by the simple observation that the constant function is orthogonal to all other elements of ${\mathcal{R}}_{d}$ . The multivariate analogue of Theorem 2.2 then reads as follows.

Theorem 3.3.

Let $d\geq 1$ . Then the system ${\mathcal{R}}_{d}$ forms a Riesz basis of $L_{2}([0,1]^{d})$ with the Riesz constants independent of $d$ . To be more specific, the Riesz constants of the normalized system

[TABLE]

can be chosen as $A=1/2$ and $B=3/2$ independently of $d$ .

Proof.

Step 1. Using Lemma 3.1 together with the Gershgorin circle theorem, it is surprisingly simple to prove Theorem 3.3 with slightly worse constants $A$ and $B$ , cf. Remark 3.4. To improve the Riesz constants to $A=1/2$ and $B=3/2$ , we proceed more carefully. Let us denote by ${\mathbb{P}}$ the set of primes and by ${\mathbb{P}}^{\prime}={\mathbb{P}}\setminus\{2\}$ the set of odd primes. Then every odd $p\in\mathbb{N}$ can be written as $p=p_{1}^{k_{1}}\cdot\ldots\cdot p_{n}^{k_{n}}$ with $p_{1},\dots,p_{n}\in{\mathbb{P}}^{\prime}$ and $k_{1},\dots,k_{n}\in\mathbb{N}$ . If $p=1$ , then we choose $n=0$ and interprete the empty product as one.

We use the following observation. To a fixed $\alpha\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$ + $}}\makebox[7.7778pt]{$ > $}}}0$ and a pair of odd coprimed integers $2p+1$ and $2q+1$ , there exists at most one $\beta\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$ + $}}\makebox[7.7778pt]{$ > $}}}0$ such that $\alpha=(2p+1)/(2q+1)\cdot\beta$ . Then we obtain

[TABLE]

where the last step follows from (10). Using the Gershgorin circle theorem in the same way as in the proof of Theorem 2.2 then gives the bounds $1/2\leq A\leq B\leq 3/2$ , independent of $d$ .

Step 2. We show that ${\mathcal{R}}_{d}$ is also a Riesz basis. The system

[TABLE]

with $c_{k}(x)=\sqrt{2}\cos(2\pi kx)$ and $s_{k}(x)=\sqrt{2}\sin(2\pi kx)$ is an orthonormal basis of $L_{2}([0,1])$ . Therefore, all possible tensor products of the functions from (30) form an orthonormal basis of $L_{2}([0,1]^{d})$ . For this system we use the following notation

[TABLE]

Now we show that every function from (31) can be found in the closed linear span of ${\mathcal{R}}_{d}$ . This will imply the completeness of ${\mathcal{R}}_{d}.$ First, we again recall two simple formulas

[TABLE]

and, similarly,

[TABLE]

If $\#M$ is even, we use the elementary property $\cos(\alpha)\cos(\beta)=(\cos(\alpha+\beta)+\cos(\alpha-\beta))/2$ and obtain

[TABLE]

where in the last step we used the Fourier decomposition from (15). If $\#M$ is odd, we use instead the formula $\cos(\alpha)\sin(\beta)=(\sin(\alpha+\beta)+\sin(\beta-\alpha))/2$ , which yields

[TABLE]

We interprete these formulas as a decomposition of a basis function from (31) into ${\mathcal{R}}_{d}$ , which converges in $L_{2}([0,1]^{d})$ . Reasoning similarly as in the proof of Theorem 2.2, this finishes the argument. ∎

Remark 3.4.

We observe that Lemma 3.1 implies for fixed $\alpha\mathrel{\makebox[7.7778pt]{\raisebox{0.3pt}{\clipbox{.30pt{} .30pt}{$ + $}}\makebox[7.7778pt]{$ > $}}}0$

[TABLE]

Using Gershgorin’s theorem similarly as in the proof of Theorem 2.2, we could have obtained quite easily that (29) is a Riesz sequence with constants $A=2-\pi^{4}/64$ and $B=\pi^{4}/64.$

4 Neural networks

In this section we finally address the question in which classes of neural networks we can find the elements of the new Riesz basis $\mathcal{R}_{d}$ . Therefore, we first fix some notation and recall what was shown in [7, Sect. 6] (in case $d=1$ ).

A function $f:\mathbb{R}^{n_{1}}\to\mathbb{R}^{n_{2}}$ is called affine, if it can be written as $f(x)=Mx+b$ , where $M\in\mathbb{R}^{n_{2}\times n_{1}}$ is a matrix and $b\in\mathbb{R}^{n_{2}}.$ The following definition formalizes the notion of $\operatorname{ReLU}$ neural networks with width $W$ and depth $L$ , cf. Figure 4 and 5.

Definition 4.1.

Let $d,W,L$ be positive integers. Then a feed-forward $\operatorname{ReLU}$ network ${\mathcal{N}}$ with width $W$ and depth $L$ is a collection of $L+1$ affine mappings $A^{(0)},\dots,A^{(L)}$ , where $A^{(0)}:\mathbb{R}^{d}\to\mathbb{R}^{W}$ , $A^{(j)}:\mathbb{R}^{W}\to\mathbb{R}^{W}$ for $j=1,\dots,L-1$ and $A^{(L)}:\mathbb{R}^{W}\to\mathbb{R}$ . Each such a network ${\mathcal{N}}$ generates a function of $d$ variables

[TABLE]

Moreover, we denote by $\Upsilon^{W,L}$ the set of all functions, which are generated in this way by some feed-forward $\operatorname{ReLU}$ network with width $W$ and depth $L$ .

Every $S\in\Upsilon^{W,L}$ is a continuous piecewise affine function on ${\mathbb{R}}^{d}$ . If the affine mappings associated to $S$ are denoted by $A^{(l)}$ with $l=0,\ldots,L$ , then the value $S(x^{(0)})$ is computed for each input $x:=x^{(0)}\in\mathbb{R}^{d}$ after the calculation of a series of intermediate vectors $x^{(l)}:=\operatorname{ReLU}(A^{(l-1)}x^{(l-1)})\in\mathbb{R}^{W}$ , $l=1,\dots,L$ , called vectors of activation at layer $l$ . Finally the output $S(x)$ is produced as $S(x):=x^{(L+1)}=A^{(L)}x^{(L)}$ .

We collect some properties of the sets $\Upsilon^{W,L}$ , which are needed in the sequel.

Proposition 4.2.

Let $W\geq 2$ .

(i)

Let $\mathcal{Y}_{1}\in\Upsilon^{W,L_{1}},\ldots,\mathcal{Y}_{k}\in\Upsilon^{W,L_{k}}$ . Then the composition of the $\mathcal{Y}_{i}$ satisfies

[TABLE]

(ii)

Let $L\geq 1$ . Then $\Upsilon^{W,L}\subset\Upsilon^{W,L+1}$ .

Proof.

The proof of (i) can be found in [7, Prop. 4.2] for $d=1$ . The proof for general $d\geq 1$ follows virtually without any change.

For the proof of (ii), we consider the identity function ${\rm id}(x)=x$ for $x\in\mathbb{R}$ . We rewrite it as

[TABLE]

to conclude that ${\rm id}\in\Upsilon^{2,1}$ . An easy modification of (33) also shows that ${\rm id}\in\Upsilon^{W,1}$ for every $W\geq 2$ . The result then follows by (i). ∎

An important example and building block for our constructions to follow is the hat function $H:[0,1]\rightarrow\mathbb{R}$

[TABLE]

which was already used in connection with feed-forward neural networks with $\operatorname{ReLU}$ activation function by [22], cf. also [7, 20, 25]. From the representation

[TABLE]

we see, that $H$ belongs to $\Upsilon^{2,1}$ , see Figure 6.

Furthermore, since $H\in\Upsilon^{2,1}$ , we deduce from Proposition 4.2(i) that the $k$ -fold composition $H^{\circ k}:=H\circ H\circ\dots\circ H$ belongs to $\Upsilon^{2,k}$ , cf. Figure 7.

In particular, $H^{\circ m}$ is a sawtooth function taking alternatively the values [math] and $1$ at its breakpoints $l2^{-m}$ , $l=0,1,\ldots,2^{m}$ , cf. [10, Lemma III.1] or [22, Lemma 2.4]. Moreover, since the restriction of the function $(2^{m}x-\lfloor 2^{m}x\rfloor)$ on each interval $[l2^{-m},(l+1)2^{-m})$ is a linear function passing through $l2^{-m}$ with slope $2^{m}$ and $\mathcal{C}(0)=\mathcal{C}(1)$ , we have the coincidence

[TABLE]

cf. [7, Page 147]. Therefore, since $H$ and $\mathcal{C}=1-2H$ belong to $\Upsilon^{2,1}$ , we obtain from Proposition 4.2(i) that $\mathcal{C}_{2^{m}}\in\Upsilon^{2,m+1}$ for $m=\mathbb{N}_{0}$ . Concerning $\mathcal{S}$ , we deduce $\mathcal{S}\in\Upsilon^{2,2}$ from the identity $\mathcal{S}(x)=\mathcal{C}_{2}(\frac{x}{2}+\frac{3}{8})$ , $x\in[0,1]$ , which ultimately yields $\mathcal{S}_{2^{m}}\in\Upsilon^{2,m+2}$ . Moreover, for arbitrary $j\in\mathbb{N}$ , we choose the smallest $m\in\mathbb{N}$ such that $j\leq 2^{m}$ and in view of $\mathcal{C}_{j}(x)=\mathcal{C}_{2^{m}}(j2^{-m}x)$ , $j\leq 2^{m}$ , see that the following holds.

Lemma 4.3.

Let $j\in\mathbb{N}$ . Then, restricted to $[0,1]$ ,

[TABLE]

and all the entries of weight matrices and the bias vectors are bounded by 8.

This was already observed in the proof of [7, Theorem 6.2]. We now provide a multivariate version of Lemma 4.3.

Lemma 4.4.

Let $d>1$ and $\alpha\in\mathbb{Z}^{d}\setminus\{0\}$ . Then, restricted to $x\in[0,1]^{d}$ ,

[TABLE]

where $\|\alpha\|_{1}=|\alpha_{1}|+\ldots+|\alpha_{d}|$ . Also in this case, the weights and biases are bounded by 8.

Proof.

We first extend the functions $\mathcal{C}_{2^{m}}$ , $\mathcal{S}_{2^{m}}$ from $[0,1]$ to the interval $[-1,1]$ by putting

[TABLE]

and deduce that $\tilde{\mathcal{C}}_{2^{m}}\in\Upsilon^{2,m+2}$ . Let now $x\in[0,1]^{d}$ , then from $\alpha\cdot x=\alpha_{1}x_{1}+\ldots+\alpha_{d}x_{d}$ we get

[TABLE]

Moreover, using the fact that $\mathcal{C}(\alpha\cdot x)=\tilde{\mathcal{C}}_{2^{m}}(2^{-m}\alpha\cdot x)$ , we choose $m:=\lceil\log_{2}\|\alpha\|_{1}\rceil$ and obtain

[TABLE]

which implies $\mathcal{C}(\alpha\cdot x)\in\Upsilon^{2,\lceil\log_{2}\|\alpha\|_{1}\rceil+2}$ . The result for $\mathcal{S}(\alpha\cdot x)$ follows by similar considerations. ∎

Remark 4.5.

Let us point out that we only have an implicit dependence of the length $L$ of our approximating neural network on the dimension $d$ (which is displayed by the fact that it logarithmically depends on the $\ell_{1}$ -norm of $\alpha$ ).

Finally, using Lemma 4.4 we obtain a multivariate analogue of [7, Theorem 6.2], where it was shown that one can reproduce linear combinations of $\mathcal{C}_{k}$ and $\mathcal{S}_{k}$ via $\mathrm{ReLU}$ networks with a good control of the depth $L$ .

Theorem 4.6.

Let $d\geq 1$ and let $k,l\geq 0$ be integers with $k+l\geq 1$ . Let $\{\alpha_{1},\dots,\alpha_{k},\beta_{1},\dots,\beta_{l}\}\subset\mathbb{Z}^{d}\setminus\{0\}$ . Then the function

[TABLE]

belongs to $\Upsilon^{W,L}$ with

[TABLE]

and the weights and biases in this network are bounded by $\max_{\begin{subarray}{c}i=1,\ldots,k;\\ j=1,\ldots,l\end{subarray}}\{8|a_{i}|,8|b_{j}|,8\}$ .

Proof.

By Proposition 4.2 (ii) and Lemma 4.4, $\mathcal{C}(\alpha_{i}\cdot x)\in\Upsilon^{2,L}$ and $\mathcal{S}(\beta_{j}\cdot x)\in\Upsilon^{2,L}$ for every $i=1,\dots,k$ and $j=1,\dots,l$ if we restrict $x$ to $[0,1]^{d}$ . By Definition 4.1 we have the corresponding representations for $x\in[0,1]^{d}$ and all admissible $i$ ’s and $j$ ’s

[TABLE]

If $z=(z_{1},\dots,z_{n})$ is a vector in $\mathbb{R}^{n}$ and $1\leq u<v\leq n$ are integers, then we denote by $z_{u,v}=(z_{u},z_{v})$ the restriction of $z$ to the set $\{u,v\}$ . Furthermore, we denote $x^{(0)}=(x_{1},\dots,x_{d})$ and stack the networks (34) and (35) on top of each other. In this way, we obtain a series of intermediate vectors $x^{(1)},\dots,x^{(L)}\in\mathbb{R}^{W}$

[TABLE]

Finally, the result follows by observing that

[TABLE]

see Figure 8.

Since by Lemma 4.4, all the weights and biases used in the calculation of $x^{(1)},\dots,x^{(L)}$ are bounded by 8, the weights in the last step are then bounded by $\max_{i=1,\dots,k}8|a_{i}|$ and $\max_{j=1,\dots,l}8|b_{j}|$ , respectively. ∎

Acknowledgment: We would like to thank the authors of [7] for their kind permission to re-use some of their figures. We also thank Dorothee Haroske (FSU Jena, Germany) for her hospitality during our stay in Jena, where part of the work took place.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen, Optimal approximation with sparsely connected deep neural networks , SIAM J. Math. Data Sci. 1 (2019), no. 1, 8–45.
2[2] A. Bourouihiya, The tensor product of frames , Sampl. Theory Signal Image Process. 7 (2008), no. 1, 65–76.
3[3] P. Beneventano, P. Cheridito, R. Graeber, A. Jentzen, and B. Kuckuck, Deep neural network approximation theory for high-dimensional functions , available at ar Xiv:2112.14523 .
4[4] O. Christensen, Frames and projection method , Appl. Comput. Harmon. Anal. 1 (1993), 50–53.
5[5] O. Christensen, Frames containing a Riesz basis and approximation of the frame coefficients using finite dimensional methods , J. Math. Anal. Appl. 199 (1996), 256–270.
6[6] O. Christensen, An introduction to frames and Riesz bases, Applied and Numerical Harmonic Analysis. Birkhäuser Boston, Inc., Boston, MA, 2003.
7[7] I. Daubechies, R. De Vore, S. Foucart, B. Hanin, and G. Petrova, Nonlinear Approximation and (Deep) Re LU Networks , Constr. Appr. 55 (2022), 127–172.
8[8] R. De Vore, B. Hanin, and G. Petrova, Neural network approximation , Acta Numer. 30 (2021), 327–444.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A multivariate Riesz basis of ReLU neural networks

Abstract

1 Introduction

Definition 1.1**.**

Definition 1.2**.**

2 Univariate case

Lemma 2.1**.**

Proof.

Theorem 2.2**.**

Remark 2.3**.**

Lemma 2.4**.**

Proof.

Lemma 2.5**.**

Proof.

3 Multivariate case

Lemma 3.1**.**

Remark 3.2**.**

Proof of Lemma 3.1.

Theorem 3.3**.**

Proof.

Remark 3.4**.**

4 Neural networks

Definition 4.1**.**

Proposition 4.2**.**

Proof.

Lemma 4.3**.**

Lemma 4.4**.**

Proof.

Remark 4.5**.**

Theorem 4.6**.**

Proof.

Definition 1.1.

Definition 1.2.

Lemma 2.1.

Theorem 2.2.

Remark 2.3.

Lemma 2.4.

Lemma 2.5.

Lemma 3.1.

Remark 3.2.

Theorem 3.3.

Remark 3.4.

Definition 4.1.

Proposition 4.2.

Lemma 4.3.

Lemma 4.4.

Remark 4.5.

Theorem 4.6.