Empirical Analysis of the Necessary and Sufficient Conditions of the   Echo State Property

Sebasti\'an Basterrech

arXiv:1703.06664·cs.NE·March 21, 2017

Empirical Analysis of the Necessary and Sufficient Conditions of the Echo State Property

Sebasti\'an Basterrech

PDF

Open Access

TL;DR

This paper empirically investigates the conditions under which the Echo State Property holds in Echo State Networks, revealing that optimal accuracy occurs near the border of stability closer to sufficient conditions.

Contribution

It provides empirical insights into the gap between necessary and sufficient conditions for the ESP, enhancing understanding of reservoir generation for better accuracy.

Findings

01

Optimal accuracy is near the border of stability control.

02

The border is closer to sufficient conditions than necessary conditions.

03

Empirical analysis clarifies the theoretical gap in ESP conditions.

Abstract

The Echo State Network (ESN) is a specific recurrent network, which has gained popularity during the last years. The model has a recurrent network named reservoir, that is fixed during the learning process. The reservoir is used for transforming the input space in a larger space. A fundamental property that provokes an impact on the model accuracy is the Echo State Property (ESP). There are two main theoretical results related to the ESP. First, a sufficient condition for the ESP existence that involves the singular values of the reservoir matrix. Second, a necessary condition for the ESP. The ESP can be violated according to the spectral radius value of the reservoir matrix. There is a theoretical gap between these necessary and sufficient conditions. This article presents an empirical analysis of the accuracy and the projections of reservoirs that satisfy this theoretical gap. It…

Equations22

E (y (t), b (t)) = \frac{⟨ ∣∣ b ( t ) - y ( w , a ( t )) ∣ ∣ ^{2} ⟩}{⟨ ∣∣ b ( t ) - ⟨ b ( t )⟩ ∣ ∣ ^{2} ⟩},

E (y (t), b (t)) = \frac{⟨ ∣∣ b ( t ) - y ( w , a ( t )) ∣ ∣ ^{2} ⟩}{⟨ ∣∣ b ( t ) - ⟨ b ( t )⟩ ∣ ∣ ^{2} ⟩},

s (t + 1) = ψ (w^{in} a (t + 1) + w^{r} s (t)),

s (t + 1) = ψ (w^{in} a (t + 1) + w^{r} s (t)),

y (t) = ν (w^{out}, s (t)) = w^{out} s (t) .

y (t) = ν (w^{out}, s (t)) = w^{out} s (t) .

w^{out} = B S^{T} (S S^{T} + γ^{2} I)^{- 1},

w^{out} = B S^{T} (S S^{T} + γ^{2} I)^{- 1},

M M D S = \frac{1}{∣Δ t ∣} Δ t, i \neq = j \sum \frac{( L ( i , j ) - D ( i , j ) ) ^{2}}{D ( i , j )},

M M D S = \frac{1}{∣Δ t ∣} Δ t, i \neq = j \sum \frac{( L ( i , j ) - D ( i , j ) ) ^{2}}{D ( i , j )},

\frac{\partial u ( t )}{\partial t} = \frac{0.2 u ( t - τ )}{1 + u ( t - τ ) ^{10}} - 0.1 u (t),

\frac{\partial u ( t )}{\partial t} = \frac{0.2 u ( t - τ )}{1 + u ( t - τ ) ^{10}} - 0.1 u (t),

a (t) = s in (0.2 t) + s in (0.311 t) + z, t = 1, 2 \dots,

a (t) = s in (0.2 t) + s in (0.311 t) + z, t = 1, 2 \dots,

\frac{\partial x}{\partial t} = σ (y - x), \frac{\partial y}{\partial t} = r x - y - x z, \frac{\partial z}{\partial t} = x y - b z,

\frac{\partial x}{\partial t} = σ (y - x), \frac{\partial y}{\partial t} = r x - y - x z, \frac{\partial z}{\partial t} = x y - b z,

\frac{\partial x}{\partial t} = - z - y, \frac{\partial y}{\partial t} = x + r y, \frac{\partial z}{\partial t} = b + z (x - c),

\frac{\partial x}{\partial t} = - z - y, \frac{\partial y}{\partial t} = x + r y, \frac{\partial z}{\partial t} = b + z (x - c),

x (t + 1) = 1 - r x^{2} (t) + y (t), y (t + 1) = b x (t) .

x (t + 1) = 1 - r x^{2} (t) + y (t), y (t + 1) = b x (t) .

x (t + 1) = 1 - r x^{2} (t) + b x (t - 1) .

x (t + 1) = 1 - r x^{2} (t) + b x (t - 1) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Advanced Memory and Neural Computing · Optical Network Technologies

Full text

Empirical Analysis of the Necessary and Sufficient Conditions of the Echo State Property

Sebastián Basterrech

Department of Computer Science, Faculty of Electrical Engineering

Czech Technical University, Praha, Czech Republic

[email protected] This is a version of an accepted paper that will appear in proceeding of the IEEE International Joint Conference on Neural Networks (IJCNN) 2017.

( )

Abstract

The Echo State Network (ESN) is a specific recurrent network, which has gained popularity during the last years. The model has a recurrent network named reservoir, that is fixed during the learning process. The reservoir is used for transforming the input space in a larger space. A fundamental property that provokes an impact on the model accuracy is the Echo State Property (ESP). There are two main theoretical results related to the ESP. First, a sufficient condition for the ESP existence that involves the singular values of the reservoir matrix. Second, a necessary condition for the ESP. The ESP can be violated according to the spectral radius value of the reservoir matrix. There is a theoretical gap between these necessary and sufficient conditions. This article presents an empirical analysis of the accuracy and the projections of reservoirs that satisfy this theoretical gap. It gives some insights about the generation of the reservoir matrix. From previous works, it is already known that the optimal accuracy is obtained near to the border of stability control of the dynamics. Then, according to our empirical results, we can see that this border seems to be closer to the sufficient conditions than to the necessary conditions of the ESP.

1 Introduction

Recurrent Neural Networks (RNNs) are fascinating tools for modelling time-series. At the beginning of the 2000s, the Echo State Network (ESN) [7] and Liquid State Machines (LSM) [17] have been introduced to the Neural Network community. They are RNNs with a specific topology and training procedure. Both models have been developed independently [10]. During the last years many variations of the original ESN and LSM have been introduced. Since around 10 years ago, all those methods became known as Reservoir Computing (RC) models. Nowadays, the RC models are very popular due to several characteristics, which can be summarised as: robustness, fast computing, understandable, easy to programming, and good accuracy. They have achieved good performance in solving well-known benchmark problems. In particular, they have been successfully applied to solve temporal learning problems [15].

A fundamental property of the ESN concerning to the network stability is named Echo State Property (ESP). The ESP guarantees good designs of the topology of the ESN. In other words, the model is in a suitable state to do good predictions. The spectral radius and the singular value of the matrix of recurrent connections (named reservoir matrix) are important parameters of the model. Both parameters impact on the ESP. Actually, under some algebraic condition the ESP is guaranteed (these conditions are associated with the singular value of the reservoir matrix). On the other hand, if some algebraic conditions are presented, then the ESP can be violated (these conditions are associated with the spectral radius of the reservoir matrix). We can see those conditions as the necessary and sufficient conditions related to the ESP.

1.1 Goals and Motivations

The goal of this article is to analyze the ESN accuracy and the reservoir projections for one specific subset of reservoir matrices. We focus on the experimental analysis of a ESN model when the reservoir is defined in such a way that we neither confirm the ESP nor deny the ESP existence. The main motivations of studying those reservoirs are the following ones:

•

Unfortunately, there is a theoretical gap between the necessary and sufficient conditions. There are reservoir matrices which we can not affirm if the ESP is holds or not [25].

•

This gap is big [27], we neither confirm the ESP nor deny the ESP on a large and rich family of reservoir matrices.

•

There are literature that suggests that the optimal computational performance of the reservoir units operates in a regime that lies between stable and chaotic behaviour [24, 13, 12].

•

It seems that the ESN operates optimally in a stable situation when the projections are close to the border of instability [15, 24].

•

The RC models are widely used for solving supervised learning problems. The initialization of the reservoir impacts on the model accuracy. As a consequence, an algorithm for generating optimal reservoirs is extremely valuable in the community. Here we are analyzing the best way for scaling the random initialized reservoirs.

1.2 Temporal Supervised Learning

Given a dataset $\mathcal{L}$ with $T$ pairs of inputs $\mathbf{a}(t)\in\mathcal{A}$ of dimension $N_{\rm{a}}$ and desired outputs $\mathbf{b}(t)\in\mathcal{B}$ of dimension $N_{\rm{b}}$ , the goal is finding a model $\psi(\mathbf{w},\cdot)$ such that $\psi(\mathbf{w},\mathbf{a}(t))$ approximates “better” as possible $\mathbf{b}(t)$ for all $\mathbf{a}(t)$ in $\mathcal{L}$ . We denote by $\mathbf{w}$ the undefined parameters of the model, which are adjusted according to the dataset $\mathcal{L}$ . Let $\mathbf{y}(\mathbf{a}(t))$ be the output vector produced by the model $\psi(\mathbf{w},\cdot)$ when the input is $\mathbf{a}(t)$ . In order of assessing the accuracy of the model a cost function is defined, which is a distance between the predictions $\mathbf{y}(\mathbf{w},\mathbf{a}(t))$ and the target $\mathbf{b}(t)$ , here we use the Normalized Root Mean Squared Error (NRMSE) [15]:

[TABLE]

where $||\cdot||$ denotes the Euclidean norm and $\langle\cdot\rangle$ denotes the empirical mean function. For the sake of the simplicity notation, we denote the model output only according the time index $\mathbf{y}(t)$ , instead of $\mathbf{y}(\mathbf{w},\mathbf{a}(t))$ .

There are at least two well-differentiated situations in supervised learning. In one case, each data point of $\mathcal{L}$ is independent of each other. This context is called non-temporal supervised learning. On another case, $\mathcal{L}$ contains dependent data points. This situation is named temporal learning. Even though, the RNNs and its variations can be used for solving non-temporal learning problems, the most common applications are on the context of temporal learning. In this case, the model has the form $\psi(\mathbf{w},\mathbf{a}(t),\mathbf{a}(t-1),\ldots)$ due to the fact that each point is dependent of each other one.

2 Reservoir Computing Models

The Reservoir Computing (RC) paradigm has started around 15 years ago with the introduction of a new approach for designing the topology and the training algorithms of Recurrent Neural Networks (RNNs). There is a general consensus that the first two models presented to the community are: Echo State Network (ESN) [7] and Liquid State Machine (LSM) [17]. Since 2007, these methods and their variations have started to be popular under the name Reservoir Computing models [24]. A RC method has two types of well-distinguished structures. One is a RNN which parameters (weight connections) are random initizialized and fixed during the learning process. Another structure is memory-free (without recurrences) and its parameters are adjusted using traditional approaches of supervised learning. The memory-free structure is often called readout, and most often consists of a linear regression model. Figure 1 shows a general scheme of the information flow of a RC model. The reservoir structure projects the input patterns in a new larger space. This projection has the following two goals: one is to enhance the linear separability of the input space, another one is to memorize the sequence of input patterns. A linear regression is applied from the projected space to the output space for generating the model outputs.

There are several types of RC models. Although, the main difference among them is the kind of activation function in the reservoir nodes and the type of supervised learning tool in the memory-free structure. For example, the LSM arises from the interest in making a conceptual representation of the cortical microstructures in the brain, the neurons on the reservoir are LIF neurons [18]. A RC model named Leaky integrator ESN introduced in [9] has gained popularity due to its well results in practice [9, 15]. In this model each neuron has a weighted memory about its previous state. Then, the variation of the neuron state is much more smooth than in the case of classic sigmoid neurons. A reservoir with dynamical synapses and threshold logic rates has been studied in [24]. Reservoir units with presence of noise has been studied in [21]. In addition, two models with neurons inspired from recursive Self-Organizing Maps (SOMs) have been also developed in [14, 1]. Another RC model that combines ideas from another scientific area has been introduced in [2], in this case the activations are based on queueing network behaviour. The presented list of RC model examples isn’t exhaustive. All these models have in common that they have a specific type of projection from the input patterns in a large space. These projections have a type of memory given by recurrences on the network, and their parameters remain fixed during the training.

2.1 Mathematical Formalization of the Echo State Network Model

We are following the previous notation. Given a learning dataset $\mathcal{L}$ with inputs $\mathbf{a}(t)\in\mathcal{A}$ of dimension $N_{\rm{a}}$ , a reservoir is a RNN composed by $N_{\rm{s}}$ interconnected neurons. The connections are collected in $N_{\rm{s}}\times N_{\rm{s}}$ matrix that we denote by $\mathbf{w}^{\rm{r}}$ . A matrix $\mathbf{w}^{\rm{in}}$ with dimensions $N_{\rm{a}}\times N_{\rm{s}}$ collects the forward weights between the inputs and reservoir neurons. For notation simplicity, we include the bias in $\mathbf{w}^{\rm{in}}$ . We assume discrete dynamics, then at each time step $t$ an input pattern $\mathbf{a}(t)$ is presented to the network, and the reservoir is computed following the recurrent expression

[TABLE]

where $\psi(\mathbf{w}^{\rm{in}},\mathbf{w}^{\rm{r}},\cdot)$ is a Lipschitz function [25], most often is the hyperbolic tangent function. We can see the reservoir as an independent RNN from an input space $\mathcal{A}$ to $\mathcal{S}$ that expands the history of input data $(\mathbf{a}(t),\mathbf{a}(t-1),\ldots)$ into a space of dimension $N_{\rm{s}}$ . The reservoir size is selected such that $N_{\rm{a}}\ll N_{\rm{s}}$ . Once the projections from $\mathcal{A}$ to $\mathcal{S}$ are performed, a parametric function $\nu:\mathcal{S}\rightarrow\mathcal{B}$ is learnt using the training samples in $\L$ . In the canonical ESN the function $\nu(\mathbf{w}^{\rm{out}},\cdot)$ is a linear model, and its parameters ( $\mathbf{w}^{\rm{out}}$ ) are the forwards weights between the reservoir neurons and the output neurons. We collect those weights in a $N_{\rm{b}}\times N_{\rm{s}}$ matrix. Again, we avoid the bias term of the linear regression in $\mathbf{w}^{\rm{out}}$ . The model output is computed as

[TABLE]

A popular training algorithm for computing $\mathbf{w}^{\rm{out}}$ in the expression (3) is the offline ridge regression [7]. The algorithm uses two auxiliary matrices $\mathbf{S}$ and $\mathbf{B}$ of dimensions $N_{\rm{s}}\times T$ and $N_{\rm{b}}\times T$ , respectively. These matrices collect in their rows the reservoir projections $\mathbf{s}{(t)}$ and target data $\mathbf{b}{(t)}$ . Then, the output weight matrix $\mathbf{w}^{\rm{out}}$ is computed by

[TABLE]

where $\mathbf{I}$ is the identity matrix of rank $N_{\rm{s}}$ , $\gamma$ is a regularization parameter, and the matrices $\mathbf{B}\mathbf{S}^{T}$ and $\mathbf{S}\mathbf{S}^{T}$ have dimensions $N_{\rm{b}}\times N_{\rm{s}}$ and $N_{\rm{s}}\times N_{\rm{s}}$ , respectively. As a consequence, the solution complexity does not depend on the number of samples, neither in time or in space [15].

2.2 Properties of the ESN projections

The ESN belongs to the family of random projection models [3]. The model is based on the fact that a random encoding of the input samples can enhance their linear separability. Even though the trajectories of the reservoir states are random initialized, the model should be independent of the initial network trajectories in the long term. Then, the network needs to have some type of fading memory with respect of the initial conditions and initial dynamics. Additionally, it should satisfy a type of “functional” relationship where each input sequence has a single output sequence in the long term. These two characteristics are established in a property regarding the transitions of the reservoir states named Echo State Property (ESP) [15]. In the following we present the ESP [7]. It is assumed that the network topology hasn’t got feedback connections, the input sequences belong to an input space $\mathcal{A}$ , and the network states are in a compact set $\mathcal{S}$ , then ESN has echo states if $\mathbf{s}(t)$ is uniquely determined by any left-infinite input sequence $\{\mathbf{a}(t-k):k\in\mathds{N}\}$ [27]. The ESP establishes that the trajectories of reservoir states only depend of the input driven network, it doesn’t depend on the initial conditions of the network. In other words, similar reservoir states must be generated for similar input sequences. If the model doesn’t satisfy the ESP, then it implies that small perturbations can bring the network to new states, which can impact on the prediction abilities of the model [25].

We specify some notation, let $\rho(A)$ be the spectral radius of a matrix $A$ , and let $\eta(A)$ be the singular value of $A$ . The following fundamental result has been analyzed in [7, 15, 25]: if the maximum singular value of the reservoir connexion matrix is bounded, then the model satisfy the ESP for every input. In more detail, if $\eta(\mathbf{w}^{\rm{r}})<1$ (which is defined as $\sqrt{\rho(\mathbf{w}^{\rm{r}}{\mathbf{w}^{\rm{r}}}^{T})}$ , where ${\mathbf{w}^{\rm{r}}}^{T}$ is the transposed reservoir matrix) then the ESP is held for every input. On the other hand, the ESP is violated when the $\rho(\mathbf{w}^{\rm{r}})>1$ , with the additional condition that $\mathcal{A}$ contains the zero input sequence. As a consequence, $\rho(\mathbf{w}^{\rm{r}})\leq 1$ is used as a necessary condition for the ESP. In addition, the ESP can be lost even for $\rho(\mathbf{w}^{\rm{r}})<1$ (e.g. in zero-input case), and vice-verse, the ESP can be preserved for $\rho(\mathbf{w}^{\rm{r}})>1$ [19].

Therefore, there are two well-analysed situations, a sufficient and a necessary condition related to the ESP. In summary, we have:

•

Sufficient condition: if the $\eta(\mathbf{w}^{\rm{r}})<1$ , then the ESP is satisfied.

•

Necessary condition: it is necessary that $\rho(\mathbf{w}^{\rm{r}})\leq 1$ in order of holding the ESP.

A simple procedure for creating an ESN is to randomly initialize the reservoir matrix $\mathbf{w}^{\rm{r}}_{\rm{initial}}$ and then to scale it using a factor $\alpha$ as follows: $\mathbf{w}^{\rm{r}}=\alpha\mathbf{w}^{\rm{r}}_{\rm{initial}}$ . The selection of the scaling factor impacts on the ESP. The sufficient condition to hold the ESP states that $\alpha<\eta(\mathbf{w}^{\rm{r}})^{-1}$ , and the necessary condition states that $\alpha<\rho(\mathbf{w}^{\rm{r}})^{-1}$ . In practice, to use the sufficient condition can be conservative. Furthermore, it can produce a negative impact on the long memory capacity of the reservoir [7, 27]. The sufficient condition can be too restrictive. On the other hand, if is violated the necessary condition ( $\rho(\mathbf{w}^{\rm{r}})>1$ ) the network has an asymptotically unstable null state thus, the ESP is lost for any input set containing a zero-input pattern [7].

The stability also has been analyzed in [26], the authors analyze a new sufficient and softer condition for the ESP. The ESP is studied in terms of the diagonal Schur stability, based on a positive definite matrix [26]. As far as we know, there is a theoretical gap about the ESP existence when $\alpha$ belongs to the interval $U=[\eta(\mathbf{w}^{\rm{r}})^{-1},\rho(\mathbf{w}^{\rm{r}})^{-1}]$ . When the scaling factor $\alpha$ belongs to $U$ the conditions about the ESN stability are unknown. Figure 2 represents the theoretical results about ESP. In [27] has been analyzed the asymptotic behaviour of this theoretical gap according characteristics of $\mathbf{w}^{\rm{r}}$ . The authors using random matrix theory have proven that the size of $U$ is large. The bound of the necessary condition is about twice the bound of the sufficient condition when the reservoir is composed by a very large pool of neurons (when $N_{\rm{s}}\rightarrow\infty$ ). During this article we refer many times to the interval $U$ , for this reason we name $U$ as the Interval of the Theoretical Unknown Conditions (ITUC). In this article, we study the accuracy of the model for reservoirs generated with $\alpha\in U$ , when $U$ is an ITUC. On other words, we analyze the behaviour of models with scaled reservoirs with scaling factors in ITUC.

3 Empirical evaluations

3.1 Methodology

We analyze the behaviour of the canonical ESN when a random reservoir is generated with a scaling factor $\alpha$ in $U$ , where $U$ is an ITUC defined in the previous section. We evaluate the accuracy of the model with the NRMSE on a group of well-known benchmark dataset. The problems are described in the next subsection. As usual, we split the sequential data in two sets, one for setting the readout weights and another one is for their validation. The error is computed applying free-run prediction (one step ahead). Them, the precedent predicted values are used as input patterns for predicting the next output. We define a grid of values for the reservoir size and the scaling factors. This grid depends on the benchmark problem. Although, we always consider 10 different values of $N_{\rm{s}}$ and 10 different values of scaling factors $\alpha$ . In order of producing statistically significant results, we perform the experiments on a benchmark dataset using $30$ different random initialisations. For each specific benchmark problem, we arbitrary define $10$ reservoir size values $N_{\rm{s}}^{(1)},\ldots,N_{\rm{s}}^{(10)}$ . For each reservoir size $N_{\rm{s}}^{(i)}$ , we randomly initialize a reservoir matrix ${\mathbf{w}^{\rm{r}}}^{(i)}_{initial}$ . Next, we compute the ITUC $U^{(i)}$ . For each interval $U^{(i)}$ , we compute $10$ values of scaling factors $\alpha^{(i,1)},\alpha^{(i,2)},\ldots,\alpha^{(i,10)}$ . Then, we evaluate the model with a scaled reservoir matrix ${\mathbf{w}^{\rm{r}}}^{(i,j)}$ , which is the original ${\mathbf{w}^{\rm{r}}}^{(i)}_{inital}$ after of being scaled with $\alpha^{(i,j)}$ . Note that, we repeat this experiment 30 times, therefore for each trial the interval $U^{(i)}$ is a different one, then the scaling factors are also different ones.

Figure 3 shows the different values of $\alpha$ when the problem was Mackey-Glass dataset. On the vertical axis are the scaling factor values and on the horizontal axis there are the experimental trials. The number of experimental trials for each benchmark problem was 3000 (total = number of repetitions (30) $\times$ different reservoirs (10) $\times$ different scaling factors (10)). The experiment number (experiment identification) increases with the larger of the reservoir, it means that the first 300 experiments corresponds to the smallest pool of reservoir, then the scaling factor is decreasing when the reservoir size is increasing. This is due to the fact that larger reservoir matrices have larger spectral radius and larger singular values [27].

The input and reservoir weights are randomly initialised in the range $[-0.5,0.5]$ . In this article we are using full connected reservoirs. Most often in the literature, a reservoir is built as a sparse pool of interconnected neurons (around $20\%$ of non zero values). However, there is an empirical evidence that the density of the reservoir matrix isn’t a relevant factor on the model accuracy with respect to the relevance of the reservoir size and the spectral radius [16]. In general, it is used sparse reservoirs only for computational reasons, because models with sparse matrices are faster than the models with dense ones. All the simulations have been done in Matlab.

For each input pattern $\mathbf{a}\in\mathcal{A}$ , the reservoir creates a high dimensional vector $\mathbf{s}\in\mathcal{S}$ . The dimension of $\mathcal{S}$ is much larger than the dimension of $\mathcal{A}$ . There are several techniques for dimensionality reduction and visualization of high dimensional datasets. For example, these techniques include Metric Multidimensional Dimensionality Scaling [23], PCA, Self-Organizing Maps [11], Sammon projections, Scale Invariant Maps [4], etc. In order of analyzing how different values of $\alpha\in U$ can generate different reservoir projections, we define a multidimensional metric inspired of the techniques for dimensionality reduction mentioned above. We define a metric that is a slight modification of the multidimensional scaling (MDS). Let $L(i,j)$ be the distance between two patterns $\mathbf{a}(i)$ and $\mathbf{a}(j)$ in the input space $\mathcal{A}$ . Let $D(i,j)$ be the distance of two vectors on the projected space, that is the distance between $\mathbf{s}(i)$ and $\mathbf{s}(j)$ (the reservoir states generated by the network when the inputs are $\mathbf{a}(i)$ and $\mathbf{a}(j)$ ). In all the cases we are considering the euclidean distance. Then, we define the mean of the multidimensional scaling distance (we use the acronym MMDS), as follows:

[TABLE]

where $\Delta t$ is some arbitrary range of time and we denote by $|\Delta t|$ the number of input patterns considered in this time range. Note that the form of MMDS is similar also to the Sammon error [4]. The goal of defining this measure is to have a notion about the topographic characteristic of the projections. Small MMDS values are produced when $L(i,j)$ is near to $D(i,j)$ . On the other hand, large MMDS values are produced when close input patterns are projected far from each other.

3.2 Benchmark Problems Description

We analyze the reservoir projections using the following well-known simulated datasets:

3.2.1 Mackey-Glass time-series

Classic benchmark problem that has been analyzed in several papers on the RC area [7, 5, 8]. The dynamics are given by:

[TABLE]

a common value for the parameter $\tau$ is $17$ , due to the fact that when $\tau>16.8$ the system has a chaotic attractor [5].

3.2.2 Noisy Multiple Superimposed Oscillator (MSO) time-series

The noisy MSO is a sequential dataset generated for two sine waves and gaussian noise. The series is [22]:

[TABLE]

where $z$ is a Gaussian random variable with distribution $\mathcal{N}(0,0.01)$ . We simulate 10000 samples for training the model, and we present the performance of the trained model on 1000 unseen simulated samples.

3.2.3 Lorenz attractor

The series is based on the Lorenz equations:

[TABLE]

we used the parameters $r=28$ , $b=8/3$ and $\sigma=10$ and step size $0.01$ . For more information about the integration of the ordinary differential equations is possible to see Runge-Kutta method [20]. The training set has 13107 samples and the testing set contains 3277 samples. Once the dynamics are simulated we normalize the data in the range [0,1].

3.2.4 Rossler attractor

Classic time-series with a sequence generated for the dynamics:

[TABLE]

where the parameters values are $r=0.15$ , $b=0.20$ , $c=10.0$ .

3.2.5 Henon map

The Henon map is a well-known invertible mapping of a two-dimensional plane into itself [6]. The sequence is generated by:

[TABLE]

where $r=1.4$ , $b=0.3$ and initial states are $x=1$ , and $y=1$ . Equivalent the sequence can be expressed as a 2-step recurrence as

[TABLE]

This sequence has been analysed with ESN in at least the following works [1, 21].

3.3 Empirical Results

On the first benchmark problem we used a regularization factor on the ridge regression of $0.0001$ , and reservoir sizes: $20$ , $50$ , $75$ , $100$ , $150$ , $200$ , $250$ , $500$ , $750$ and $1000$ . On the rest of the problems the regularization factor was $0.001$ and reservoirs in $\{20,50,75,100,150,200,250,300,400,500\}$ . Figure 4 shows several plots obtained with the Mackey-Glass time-series. Each one corresponds to a specific reservoir size, which is specified in the top of each graphic. The horizontal axis of each subplot corresponds to the scaling factor $\alpha$ , and the vertical axis corresponds to the NRMSE. Note that $U$ is different for each reservoir size. We can see that for the “small” reservoirs ( $N_{\rm{s}}<150$ ), the accuracy is better when $\alpha$ is closer to the lower bound of $U$ . On the other hand, for very large reservoirs the relationship between the accuracy and the scaling factor isn’t clear.

For each benchmark problems we are presenting two types of figures. One presents the accuracy NMSE with respect of the reservoir size and the $U$ interval. The another one presents the MMDS according to the reservoir size and the $U$ interval. As we mentioned above, the $U$ interval depends of the reservoir size and the random initialization of the reservoir. Therefore, these graphics have been built as follows: for a specific reservoir size, we compute the $U$ interval, and a regular grid with 10 values. Then, we compute the average among the accuracy obtained on the 30 experiments. Figures 4 and 5 show that the scaling factor and the accuracy are sensible to the reservoir size. Extremely large reservoirs can be more unstable. Figure 6 shows (in the case of MSO dataset) that very large reservoirs and $\alpha$ values close to the upper bound of $U$ can cause unstable model accuracy. When the reservoir is small, it seems that the behaviour of the reservoir projection is independent of the value of $\alpha\in U$ . Figure 7 shows the results for the Lorenz attractor benchmark problem. We can again see that small reservoirs are more stable, and it seems that the value of $\alpha$ doesn’t impact on the accuracy when $N_{\rm{s}}$ is less than $200$ . On the other hand, for Lorenz attractor dataset the experiments with smaller $\alpha$ values and large reservoirs have the worst accuracy. Figures 8 and 9 show the accuracy obtained with Rossler attractor and Henon map datasets. Both figures have the same characteristics, the value of $\alpha$ seems to be less important on the accuracy than the reservoir size.

Another group of pictures analyze how the scaling factor impact on the topographic characteristic of the reservoir projections. In general, we can see that larger reservoirs provoke larger MMDS values. However, the relationship between the MMDS values and $\alpha$ values depends on the benchmark data. Figures 11 and 12 show how the MMDS is almost constant along the $U$ interval. On these figures the MMDS increases with the reservoir size. The value of $\alpha$ seems to impact on MMDS measure according to the Figures 10 and 13. The impact seems to be less relevant than the impact of the reservoir size, but anyway we can see how larger values of $\alpha$ may cause larger values of MMDS. A different behaviour occurs with the Henon map dataset, in Figure 14 we can see that both the scaling factor and the reservoir size are relevant parameters. A final remark, note that in almost all the benchmark problems the best accuracy occurs when the values of $\alpha$ are near to the lower bound of $U$ . As well as, in many cases the accuracy is stable for the different values of $\alpha$ in $U$ .

4 Conclusions

A fundamental property of the Echo State Network (ESN) model is the Echo State Property (ESP), which impacts on the model predictions. A sufficient condition for the ESP involves the singular values of reservoir matrix. On the other hand, a necessary condition for the ESP also has been introduced, the ESP is violated according to the spectral radius value of the reservoir matrix. There is a theoretical gap between the necessary and sufficient conditions for the ESP. We specify this gap in an interval named Interval of the Theoretical Unknown Conditions (ITUC), which is defined as function of the spectral and singular value of the reservoir matrix. There is a large group of reservoirs, which we can’t affirm that the ESP is satisfied nor ESP violation. This article presents an empirical analysis of the accuracy and the projections of reservoirs that belong to this group. According our experimental results, in some benchmark problems the best accuracies occur when the reservoirs are near to satisfy the sufficient condition for the ESP. However, for small reservoirs with different spectral radius and singular values the accuracy obtained is stable. From previous works, is known that the optimal accuracy is obtained near to the border of stability control of the dynamics. According to our results, it seems that this control border is closer to the sufficient condition than to the necessary condition. In addition, we studied the reservoir projections using a type of multidimensional scaling metric. We found different behaviour according to the benchmark problem.

In the near future, it can be interesting to analyze the ITUC using other metrics on the reservoir projections. For example, the exponential Lyapunov of reservoir projections created with scaling factor values in the ITUC. In addition, the memory capacity when the scaling factor belongs to the ITUC can be also of interest for the community.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Sebastián Basterrech, Colin Fyfe, and Gerardo Rubino. Self-organizing Maps and Scale-invariant Maps in Echo State Networks. In 11th International Conference on Intelligent Systems Design and Applications, ISDA 2011, Córdoba, Spain, November 22-24, 2011 , pages 94–99, November 2011.
2[2] Sebastián Basterrech and Gerardo Rubino. Echo State Queueing Network: A new Reservoir Computing Learning Tool. In 10th IEEE Consumer Communications and Networking Conference, CCNC 2013, Las Vegas, NV, USA, January 11-14, 2013 , pages 118–123, 2013.
3[3] J. B. Butcher, D. Verstraeten, B. Schrauwen, C. R. Day, and P. W. Haycock. Reservoir Computing and Extreme Learning Machines for Non-linear Time-series Data Analysis. Neural Networks , 38:76–89, feb 2013.
4[4] Colin Fyfe. Hebbian Learning and Negative Feedback Networks , volume XVIII of Advanced Information and Knowledge Processing . Springer-Verlag London, first edition, 2005.
5[5] Claudio Gallicchio and Alessio Micheli. Architectural and Markovian factors of echo state networks. Neural Networks , 24(5):440 – 456, 2011.
6[6] M. Hénon. A two dimensional mapping with a strange attractor. Commun. Math. Phys. , 50:69–77, 1976.
7[7] Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks. Technical Report 148, German National Research Center for Information Technology, 2001.
8[8] Herbert Jaeger and Harald Haas. Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science , 304(5667):78–80, 2004.