Network Structure Effects in Reservoir Computers

Thomas L. Carroll; Louis M. Pecora

arXiv:1903.12487·cs.ET·August 30, 2019

Network Structure Effects in Reservoir Computers

Thomas L. Carroll, Louis M. Pecora

PDF

TL;DR

This paper investigates how altering the network structure of reservoir computers by flipping edges affects their properties, such as covariance rank, and explores the implications for their computational performance.

Contribution

It introduces a simple, analyzable network model for reservoir computers and examines how structural modifications influence key mathematical properties.

Findings

01

Changing the number of flipped edges alters the covariance matrix rank.

02

Network symmetries can be characterized and related to structural modifications.

03

Covariance rank may be linked to reservoir computer performance.

Abstract

A reservoir computer is a complex nonlinear dynamical system that has been shown to be useful for solving certain problems, such as prediction of chaotic signals, speech recognition or control of robotic systems. Typically a reservoir computer is constructed by connecting a large number of nonlinear nodes in a network, driving the nodes with an input signal and using the node outputs to fit a training signal. In this work, we set up reservoirs where the edges (or connections) between all the network nodes are either +1 or 0, and proceed to alter the network structure by flipping some of these edges from +1 to -1. We use this simple network because it turns out to be easy to characterize; we may use the fraction of edges flipped as a measure of how much we have altered the network. In some cases, the network can be rearranged in a finite number of ways without changing its structure;…

Equations34

\frac{d r _{i} ( t )}{d t} = λ [p_{1} r_{i} (t) + p_{2} r_{i}^{2} (t) + p_{3} r_{i}^{3} (t) + j = 1 \sum M A_{ij} r_{j} (t) + w_{i} s (t)] .

\frac{d r _{i} ( t )}{d t} = λ [p_{1} r_{i} (t) + p_{2} r_{i}^{2} (t) + p_{3} r_{i}^{3} (t) + j = 1 \sum M A_{ij} r_{j} (t) + w_{i} s (t)] .

r_{i} (n + 1) = α r_{i} (n) + (1 - α) tanh (j = 1 \sum M A_{ij} r_{j} (n) + w_{i} s (t) + 1) .

r_{i} (n + 1) = α r_{i} (n) + (1 - α) tanh (j = 1 \sum M A_{ij} r_{j} (n) + w_{i} s (t) + 1) .

{\Omega}=\left[{\begin{array}[]{*{20}{c}}{{r_{1}}\left(1\right)}&\ldots&{{r_{M}}\left(1\right)}&1\\ {{r_{1}}\left(2\right)}&{}\hfil&{{r_{M}}\left(2\right)}&1\\ \vdots&{}\hfil&\vdots&\vdots\\ {{r_{1}}\left(N\right)}&\ldots&{{r_{M}}\left(N\right)}&1\end{array}}\right]

{\Omega}=\left[{\begin{array}[]{*{20}{c}}{{r_{1}}\left(1\right)}&\ldots&{{r_{M}}\left(1\right)}&1\\ {{r_{1}}\left(2\right)}&{}\hfil&{{r_{M}}\left(2\right)}&1\\ \vdots&{}\hfil&\vdots&\vdots\\ {{r_{1}}\left(N\right)}&\ldots&{{r_{M}}\left(N\right)}&1\end{array}}\right]

h (t) = j = 1 \sum M c_{j} r_{j} (t)

h (t) = j = 1 \sum M c_{j} r_{j} (t)

h (t) = Ω C

h (t) = Ω C

Ω = U S V^{T} .

Ω = U S V^{T} .

Ω_{in v} = V S^{^{'}} U^{T}

Ω_{in v} = V S^{^{'}} U^{T}

C = Ω_{in v} g (t)

C = Ω_{in v} g (t)

Δ_{R C} = \frac{∥ Ω C - g ( t ) ∥}{∥ g ( t ) ∥}

Δ_{R C} = \frac{∥ Ω C - g ( t ) ∥}{∥ g ( t ) ∥}

Δ_{t x} = \frac{∥ Ω ^{'} C - z ^{'} ∥}{∥ z ^{'} ∥}

Δ_{t x} = \frac{∥ Ω ^{'} C - z ^{'} ∥}{∥ z ^{'} ∥}

\frac{d r}{d t} = F (r) + A H (r) + W s (t),

\frac{d r}{d t} = F (r) + A H (r) + W s (t),

\frac{d P r}{d t} = F (P r) + A H (P r) + W s (t)

\frac{d P r}{d t} = F (P r) + A H (P r) + W s (t)

Γ = rank (Ω^{T} Ω)

Γ = rank (Ω^{T} Ω)

M C_{k} = \frac{n = 1 \sum N [ s ( n - k ) - s ] [ g _{k} ( n ) - g _{k} ]}{n = 1 \sum N [ s ( n - k ) - s ] n = 1 \sum N [ g _{k} ( n ) - g _{k} ]}

M C_{k} = \frac{n = 1 \sum N [ s ( n - k ) - s ] [ g _{k} ( n ) - g _{k} ]}{n = 1 \sum N [ s ( n - k ) - s ] n = 1 \sum N [ g _{k} ( n ) - g _{k} ]}

MC = k = 1 \sum \infty M C_{k}

MC = k = 1 \sum \infty M C_{k}

\begin{array}[]{l}\frac{{dx}}{{dt}}={c_{1}}y-{c_{1}}x\\ \frac{{dy}}{{dt}}=x\left({{c_{2}}-z}\right)-y\\ \frac{{dz}}{{dt}}=xy-{c_{3}}z\end{array}

\begin{array}[]{l}\frac{{dx}}{{dt}}={c_{1}}y-{c_{1}}x\\ \frac{{dy}}{{dt}}=x\left({{c_{2}}-z}\right)-y\\ \frac{{dz}}{{dt}}=xy-{c_{3}}z\end{array}

\begin{array}[]{*{20}{l}}{x\left(k\right)={\rm{random}}\left[{0,0.5}\right]}\\ {y\left({k+1}\right)=0.3y\left(k\right)+0.05{y^{2}}\left(k\right)+1.5{x^{2}}\left(k\right)+0.1}\end{array}

\begin{array}[]{*{20}{l}}{x\left(k\right)={\rm{random}}\left[{0,0.5}\right]}\\ {y\left({k+1}\right)=0.3y\left(k\right)+0.05{y^{2}}\left(k\right)+1.5{x^{2}}\left(k\right)+0.1}\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Network Structure Effects in Reservoir Computers

T. L. Carroll

[email protected]

US Naval Research Lab, Washington, DC 20375

L. M. Pecora

[email protected]

US Naval Research Lab, Washington, DC 20375

Abstract

A reservoir computer is a complex nonlinear dynamical system that has been shown to be useful for solving certain problems, such as prediction of chaotic signals, speech recognition or control of robotic systems. Typically a reservoir computer is constructed by connecting a large number of nonlinear nodes in a network, driving the nodes with an input signal and using the node outputs to fit a training signal. In this work, we set up reservoirs where the edges (or connections) between all the network nodes are either +1 or 0, and proceed to alter the network structure by flipping some of these edges from +1 to -1. We use this simple network because it turns out to be easy to characterize; we may use the fraction of edges flipped as a measure of how much we have altered the network. In some cases, the network can be rearranged in a finite number of ways without changing its structure; these rearrangements are symmetries of the network, and the number of symmetries is also useful for characterizing the network. We find that changing the number of edges flipped in the network changes the rank of the covariance of a matrix consisting of the time series from the different nodes in the network, and speculate that this rank is important for understanding the reservoir computer performance.

**A reservoir computer is a high dimensional dynamical system that may be used for computations on time series signals. Usually the dynamical system is created by connecting a number of nonlinear nodes in a network that includes feedback. A time series input signal is coupled into the nodes of the reservoir computer and the time series output of the individual nodes are recorded. To train the reservoir computer, the node output time series are fit to some training time series that is related to the input signal, often with a least squares fit. The network itself is not changed. The fitting coefficients contain information about the relationship between the input signal and the training signal. We study several ways to characterize the structure of the network to see how the network structure affects the error in fitting the training signal. We find that if we choose a simple network where connections (or edges) between the nodes are all +1 or zero, we can produce simple changes in the network by flipping some of the edges to -1. For some networks, the network with edges flipped may contain symmetries; that is, the nodes may be rearranged without altering the network structure. In that case, the network may be characterized by the number of these symmetries. Otherwise, the network can be characterized by the fraction of edges flipped from +1 to -1. We hope that by understanding how the network structure affects the performance of the reservoir computer, we can build reservoir computers that produce more accurate solutions to problems. **

I Introduction

Reservoir computers were developed as a type of recurrent neural network by machine learning researchers Jaeger (2001); Natschlaeger et al. (2002) , but they may also be described using the language of dynamical systems. An advantage of reservoir computers over other machine learning techniques is that training a reservoir computer is fast and simple; the reservoir itself, a network of nonlinear nodes, is kept fixed, and the time series responses from the nodes are combined using a linear weighted sum, where the weights are varied to fit a training signal.

Reservoir computers have been shown to be useful for solving a number of problems, including reconstruction and prediction of chaotic attractors Lu et al. (2018); Zimmermann and Parlitz (2018); Antonik et al. (2018); Lu et al. (2017); Jaeger and Haas (2004), recognizing speech, handwriting or other images Jalalvand et al. (2018) or controlling robotic systems Lukoševičius et al. (2012). One attractive feature of reservoir computers is that they may be implemented in a wide range of analog hardware, making them potentially very fast but with low power consumption. Examples of reservoir computers so far include photonic systems Larger et al. (2012); der Sande et al. (2017), analog circuits Schurmann et al. (2004), mechanical systems Dion et al. (2018) and field programmable gate arrays Canaday et al. (2018).

One obstacle to understanding what reservoir computers can or can’t do is that there is only a limited amount of theory on how reservoir computers function. Much of the theoretical work hinges on understanding the tradeoff between nonlinearity in the reservoir computer nodes and memory Jaeger (2002); Inubushi and Yoshimura (2017); Marzen (2017), where memory is described by the time dependent correlation between the reservoir computer input and the reservoir computer fit to delayed versions of this input signal. Other work focuses on generalized synchronization Lu et al. (2018); Lymburn et al. (2019). The previous work doesn’t describe how the choice of the network influences the reservoir computer, so in this work we study the effect of different networks on the reservoir computer.

Early on, Jaeger Jaeger (2002) stated ”If the network is suitably inhomogeneous, the various echo functions will significantly differ from each other”. It does seem obvious that a diverse network is necessary for an efficient reservoir computer, but how can this inhomogeneity be measured? A standard practice is to use a sparse random network. Random networks are easy to achieve in simulations, but the options for creating a network in a real analog system may be limited by experimental constraints. It would be interesting to confirm the standard wisdom using some measure of inhomogeneity which could then be used to aid in the design of a network in a system where a creating completely random network might not be possible.

Characterizing a random network is difficult, so we use a network that is simple to characterize. All of the connections between nodes in our networks have the values $\pm 1$ or 0. Within these limits, the connections are chosen randomly. All networks are connected, which means for any two nodes, there is a path between them.

The random networks are initialized with all edges set to +1 or 0, and then some of the edges are flipped from +1 to -1.We will see that if we take two copies of the same initial network and flip the same number of edges but choose different edges to flip, the performance of the reservoir computer may be different. Therefore we choose many networks where the same number of elements are flipped and track trends in the behavior of the entire group of reservoir computers.

We will examine two different node types and two different input signals. The nodes are based on a nonlinear differential equation or a leaky hyperbolic tangent map Jaeger (2002), while the input signals will come from a Lorenz chaotic system or a nonlinear map acting on a random signal. The nonlinear mapping was chosen from a set of problems commonly used to test reservoir computers Rodan and Tino (2011).

In section II, we will describe reservoir computers and show how they may be trained. Section III describes our choice of how to feed signals into the reservoir computer, while section IV discusses how the network adjacency matrix may be characterized. All elements of the adjacency matrix, which describes the edges between nodes, start as +1 or 0, and some of the +1 edges are flipped to -1. The values of the edges are then normalized so that the maximum of the absolute value of the real part of the set of eigenvalues of the adjacency matrix is 0.5. We may then characterize the adjacency matrix by the number of symmetries it contains or by the fraction of the elements flipped from +1 to -1.

The main method we use to characterize reservoir computer performance is the testing error, described in section II, but we also use other methods to analyze the reservoir networks. These methods are described in section V. Section V.1 describes how symmetries in the network may be found, section V.2 lays out the computation of the rank of the covariance matrix for the reservoir, and section V.3 summarizes a common method for calculating memory capacity.

After laying out the analysis methods, section VI describes how we create the input signals from a Lorenz chaotic system or a random signal system. We then proceed to simulations of the reservoir computer testing error as we flip network edges in section VII. After first showing how the choice of input vector affects our results in section VII.1, we study the reservoir computer training error as a function of the number of symmetries in the network (section VII.2) and as the fraction of edges flipped (section VII.3). This section also includes measurements of the covariance matrix rank. The memory capacity for different networks is calculated in section VII.4. Finally, both fraction flipped and sparsity are varied in section VIII.

II Reservoir Computers

We used a reservoir computer to estimate one time series signal based on a different (but related) time series signal. Figure 1 is a block diagram of a reservoir computer. There is an input signal $s(t)$ from which the goal is to extract information, and a training signal $g(t)$ which is used to train the reservoir computer. In Lu et al. (2017) for example, $s(t)$ was the $x$ signal from a Lorenz chaotic system, while $g(t)$ was the Lorenz $z$ signal. The reservoir computer was trained to estimate the $z$ signal from the $x$ signal.

There are no specific requirements on the nodes in a reservoir computer, other than when all nodes are connected into a network, the network should be stable; that is, it should settle into a stable fixed point. Commonly used nodes include hyperbolic tangent Manjunath and Jaeger (2013) or sigmoid functions Verstraeten et al. (2010), but in analog experiments the node nonlinearity is determined by the experimental system Larger et al. (2012); der Sande et al. (2017); Schurmann et al. (2004); Dion et al. (2018); Canaday et al. (2018). We select two different node types to decrease the chance that our results depend on the type of node used. One node type, the polynomial node, is chosen because a polynomial is a general way to represent a nonlinearity. The parameters for the polynomial node are chosen so that network based on these nodes is stable. The second type of node is based on a sigmoid function. Sigmoid functions are common in neural network studies Hadaeghi and Jaeger (2019), although the form of our node is not the same as the most commonly used sigmoid function. The form of the nonlinearities in these two node types is different enough that our results should be general for different types of reservoir computers.

The polynomial reservoir computer is described by

[TABLE]

The $r_{i}(t)$ ’s are node variables, $A$ is an adjacency matrix indicating how the nodes are connected to each other, and ${\bf W}=[w_{1},w_{2},...w_{M}]$ is a vector that described how the input signal $s(t)$ is coupled to each node. The constant $\lambda$ is a time constant, and there are $M=100$ nodes. For the simulations described here, $p_{1}=-3$ , $p_{2}=1$ , $p_{3}=-1$ and $\lambda$ was set to minimize testing errors for different input signals. The parameters $p_{1}$ , $p_{2}$ and $p_{3}$ were chosen to give a small value of the testing error, and they were kept the same for different input signals. The properties of $A$ will be varied to understand how the performance of the reservoir computer depends on the form of $A$ .

Equation (1) was numerically integrated using a 4’th order Runge-Kutta integration routine with a time step of 0.1. Before driving the reservoir, the mean was subtracted from the input signal $s(t)$ and the input signal was normalized to have a standard deviation of 1.

The other reservoir computer used in this work is a map with nodes that implement a leaky tanh function. The leaky tanh node computer is described as Jaeger (2002)

[TABLE]

Again, $s(t)$ was normalized to have a mean of 0 and a standard deviation of 1. As with the polynomial nodes, the parameters were chosen to give a small value of the testing error for a fixed adjacency matrix, and we use 100 nodes.

When the reservoir computer was driven with $s(t)$ , the first 2000 time steps were discarded as a transient. The next $N=10000$ time steps from each node were combined in a $N\times(M+1)$ matrix

[TABLE]

The last column of $\Omega$ was set to 1 to account for any constant offset in the fit. The training signal is fit by

[TABLE]

or

[TABLE]

where ${h(t)}=\left[{h\left(1\right),h\left(2\right)\ldots h\left(N\right)}\right]$ is the fit to the training signal ${g(t)}=\left[{g\left(1\right),g\left(2\right)\ldots g\left(N\right)}\right]$ and ${{\bf C}}=\left[{{c_{1}},{c_{2}}\ldots{c_{N}}}\right]$ is the coefficient vector.

The matrix ${\Omega}$ is decomposed by a singular value decomposition

[TABLE]

where ${U}$ is $N\times(M+1)$ , ${S}$ is $(M+1)\times(M+1)$ with non-negative real numbers on the diagonal and zeros elsewhere, and ${V}$ is $(M+1)\times(M+1)$ .

The pseudo-inverse of ${\Omega}$ is constructed as a Moore-Penrose pseudo-inverse Penrose (1955)

[TABLE]

where ${{S}}^{{{}^{\prime}}}$ is an $(M+1)\times(M+1)$ diagonal matrix constructed from ${S}$ , where the diagonal element $S^{{}^{\prime}}_{i,i}=S_{i,i}/(S_{i,i}^{2}+k^{2})$ , where $k=1\times 10^{-5}$ is a small number used for ridge regression Tikhonov (1943) to prevent overfitting. There are some guidelines for choosing $k$ Golub et al. (1979), but in this case $k$ is chosen large enough to to keep the coefficients from becoming extremely large but small enough to keep the fitting error from becoming too large.

The fit coefficient vector is then found by

[TABLE]

.

The training error may be computed from

[TABLE]

where $\left\|{}\right\|$ indicates a standard deviation.

The training error tells us how well the reservoir computer can fit a known training signal, but it doesn’t tell us anything we don’t already know. To learn new information, we use the reservoir computer in the testing configuration. As an example, suppose the input signal $s(t)$ was an $x$ signal from the Lorenz system, and the training signal $g(t)$ was the corresponding $z$ signal. Fitting the Lorenz $z$ signal trains the reservoir computer to reproduce the Lorenz $z$ signal from the Lorenz $x$ signal.

We may now use as an input signal $s^{\prime}(t)$ the Lorenz signal $x^{\prime}$ , which comes from the Lorenz system with different initial conditions. We want to get the corresponding $z^{\prime}$ signal. The matrix of signals from the reservoir is now $\Omega^{\prime}$ . The coefficient vector ${\bf C}$ is the same vector we found in the training stage. The testing error is

[TABLE]

The testing error measures how accurately the reservoir computer actually solves a problem.

III The Input Coupling Vector ${\bf W}$

The coupling vector ${\bf W}=[w_{1},w_{2},...w_{M}]$ describes how the input signal $s(t)$ couples into each of the nodes. We want to look only at the effect of varying the coupling between nodes in the reservoir computer, so ${\bf W}$ is kept fixed. We have found that setting all the elements to +1 or -1 yields a larger reservoir computer testing error than setting the odd elements of ${\bf W}$ to +1 and the even elements of ${\bf W}$ to -1, so the second method (odd=+1, even=-1) was used. This choice was arbitrary, and other choices of ${\bf W}$ could be made. Below we show how the reservoir computer performs for our choice of input coupling vector compared to a random input vector.

IV Characterizing the Adjacency Matrix $A$

As described above, the reservoir contains 100 nodes, so the size of $A$ is $M\times M=100\times 100$ .

The diagonal elements of $A$ are all 0. Initially, all the off diagonal elements (network edges) of $A$ are set to +1 or 0. The initial network defined by $A$ is connected, meaning that for each pair of nodes there is a path between them. Different configurations of the network are created by flipping some of the edges between nodes from +1 to -1. The number of elements to be flipped, $N_{f}$ , is chosen and then the particular elements to be flipped are chosen randomly from all the elements that have the value +1 to give many realizations of the adjacency matrix for each value of $N_{f}$ . After the edges are flipped, the adjacency matrix is renormalized so that the absolute value of the largest real part of the matrix eigenvalues is 0.5.

Different networks with the same $N_{f}$ will reveal a range of testing errors for a fixed value of $N_{f}$ . For each $N_{f}$ value, the network is initialized to have the same adjacency matrix ${A}$ , with all edges equal to +1 or 0. For each $N_{f}$ , 20 different sets of $N_{f}$ edges from the nonzero network edges are randomly chosen to be flipped, and the testing error is calculated for each of the 20 different versions of the network.

We choose to characterize the network by the fraction of the edges flipped, or $\varepsilon_{f}$ . For some values of $\varepsilon_{f}$ , there may be ways to permute the network nodes and their attached edges that leave the network unchanged. If this type of permutation is possible, we say the network contains symmetries, and we use the number of symmetries in place of $\varepsilon_{f}$ to characterize the network.

V Analysis Methods

Besides calculating the training error for different reservoir computers, we analyze the reservoirs by the number of symmetries in the network, by the rank of the covariance matrix of the network and by the memory capacity of the network. We proceed to describe these different methods.

V.1 Symmetry

Symmetries in networks can have a dramatic effect on the dynamics. Here we use the concept of symmetry from graph theory Golubitsky et al. (1985), where a symmetry is a permutation of the nodes of the network along with the edges attached to the nodes which leave the network unchanged. This is shown in Fig. 2. A simple 4-node network is shown in Fig. 2(a). The six symmetries are obvious. Along with the identity (no permutations), there are two rotations and three mirror symmetries.

Symmetries are easy to see with small networks, but with larger networks (7 or more nodes), the detection of symmetries is difficult and quickly becomes humanly impossible, as networks with more than 10 nodes can have millions or more symmetries (see Fig. 2(b)) . There is however an algorithm Stein (2013) which can quickly determine the number of symmetries and give all possible permutation matrices from the matrix of connections. We use this algorithm here.

The reason symmetries can affect the dynamics of the network can be seen from the equations of motion. Let’s apply a symmetry permutation $P$ to the reservoir system. If ${\bf r}=(r_{1},\ldots,r_{M})$ , then $P{\bf r}=(r_{\pi(1)},\ldots,r_{\pi(M)})$ , where $\pi(i)$ is a permutation of $(1,\ldots,M)$ into a different order. That is, $P$ moves the components of $\bf r$ around into a different ordering. Note that if $P$ is a symmetry of the network, then the network coupling matrix $A$ (think of an adjaceny matrix or Laplacian, for instance) must remain unchanged under the action of $P$ , thus $PAP^{T}=A$ , recalling that $P^{-1}=P^{T}$ . This means $A$ and $P$ commute: $PA=AP$ .

The equations of motion used in this paper are of the form (see Eqs.(1) and (2)),

[TABLE]

where ${\bf F}({\bf r})$ is the node vector field, ${\bf H}({\bf r})$ is the coupling function, $A$ is the coupling matrix, and ${\bf W}s(t)$ is the weighted driving term. If the weights for the drive term are invariant under $P$ , i.e., $P{\bf W}={\bf W}$ and we apply a symmetry $P$ to Eq. (11) and recall that $A$ and $P$ commute and the functions ${\bf F}$ and ${\bf H}$ are the same for all nodes, we get,

[TABLE]

In other words, the permuted nodes $P{\bf r}$ have the same equation of motion as the original nodes. The consequence is that if the subsets of nodes that are permuted are started in synchronized state ( $r_{\pi(i)}(t_{0})=r_{i}(t_{0})$ ), they will remain synchronized. This dynamic is called flow invariance. The state where symmetry-related nodes synchronize is called cluster synchronization, where the nodes related by symmetry permutations are synchronized among themselves, but are not synchronized with nodes in other clusters Golubitsky et al. (1985); Pecora et al. (2014); Cho et al. (2017).

If this synchronized state is stable, then it is possible that the system will evolve into it. The dimension of the network will then be lower since multiple nodes will follow identical trajectories. Note that even if the symmetries are approximate, i.e. the components of $A$ and/or the node dynamics ${\bf F}$ , and/or the weights ${\bf w}$ vary only slightly from a symmetric case there can still be approximate synchronization Sorrentino and Pecora (2016) where the trajectories of nodes related by symmetry permutations tend to closely follow an average trajectory. This still results in a reduction of dimension and complexity.

V.2 Covariance Rank of $\Omega$

We may also characterize the matrix of reservoir computer signals $\Omega$ . The individual columns of $\Omega$ will be used as a basis to fit the training signal $g(t)$ . The columns of $\Omega$ may be correlated with each other, so we would like to know the number of uncorrelated columns in $\Omega$ .

Principle component analysis Jolliffe (2011) states that the eigenvectors of the covariance matrix of $\Omega$ , $\Theta=\Omega^{T}\Omega$ , form an uncorrelated basis set. The rank of the covariance matrix tells us the number of uncorrelated vectors. Therefore, we will use the rank of the covariance matrix of $\Omega$ ,

[TABLE]

to characterize the reservoir matrix $\Omega$ . We calculate the rank using the MATLAB rank() function, which returns the number of singular values above a certain threshold. The threshold is $\gamma_{r}={D_{\max}}\delta\left({{\sigma_{\max}}}\right)$ , where $D_{max}$ is the largest dimension of $\Omega$ and $\delta(\sigma_{max})$ is the difference between the largest singular value of $\Omega$ and the next largest double precision number.

V.3 Memory

Memory capacity, as defined in Jaeger (2002), is considered to be an important quantity in reservoir computers. Memory capacity is a measure of how well the reservoir can reproduce previous values of the input signal.

The memory capacity as a function of delay is

[TABLE]

where the overbar indicator indicates the mean. The signal $g_{k}(n)$ is the fit of the reservoir signals $r_{i}(n)$ to the delayed input signal $s(n-k)$ . The memory capacity is

[TABLE]

Input signals such as the Lorenz $x$ signal contain correlations in time, which will cause errors in the memory calculation, so in eq. (14), $s(n)$ is a random signal uniformly distributed between -1 and +1. The node parameters were optimized to minimize the testing error for fitting the reservoir signals to $s(n-1)$ . There are some drawbacks to defining memory in this way; the reservoir is nonlinear, so its response will be different for different input signals, and the node parameters for the memory calculation are not the same as for calculations with other input signals. Nevertheless, this memory definition is the standard definition used in the field of reservoir computing.

VI Input signals

The first system we used to generate input and training signals is the Lorenz system Lorenz (1963)

[TABLE]

with $c_{1}$ =10, $c_{2}$ =28, and $c_{3}$ =8/3. The equations were numerically integrated with a time step of $t_{s}=0.02$ .

The second system is a nonlinear map acting on a random signal, taken from Rodan and Tino (2011)

[TABLE]

This system is commonly used as a test of the ability of a reservoir computer to fit a signal.

VII Simulations: Flipping Network Edges

Because the reservoir is nonlinear, changing the adjacency matrix can have a complicated effect on the testing error. In order to get good statistics, each time the number of network edges $N_{f}$ to be flipped was chosen, 20 different adjacency matrices were generated with with $N_{f}$ randomly flipped edges. The graphs below show all 20 values of the testing error for each number of edges flipped.

We begin with a $100\times 100$ adjacency matrix with 9800 of the network edges equal to +1. The large number of nonzero edges made it more likely that the network would have symmetries. Once the number of nonzero edges was chosen, the specific nonzero edges were chosen randomly until a network with a large number of symmetries was found. All of the diagonal elements are 0.

If the number of nonzero network edges is $N_{1}$ , then the fraction of edges flipped is $\varepsilon_{f}=N_{f}/N_{1}$ . For some values of $\varepsilon_{f}$ , the network contains symmetries (see section V.1). For networks that contain symmetries, the network will be characterized by the number of symmetries, $\zeta_{s}$ . When no edges were flipped, the network contained $9.2678\times 10^{51}$ symmetries, calculated using the methods from Stein (2013). For networks that contain only one symmetry, the identity, the network will be characterized by the fraction of elements flipped, $\varepsilon_{f}$ .

VII.1 Comparison to other networks

The network we choose appears to be specialized, so we must ask if the results will apply to other reservoir computers. As an alternative to our configuration, we simulated reservoir computers where

98% of the edges were $\pm 1$ and the elements of the input vector ${\bf W}$ alternated between +1 and -1; 2. 2.

98% of the edges were $\pm 1$ and the elements of the ${\bf W}$ were all +1; 3. 3.

98% of the edges were $\pm 1$ and the elements of the ${\bf W}$ were chosen from a random uniform distribution between $\pm 1$ ; and 4. 4.

20% of the network edges were nonzero, chosen from a uniform distribution between $\pm 1$ and the elements of the ${\bf W}$ were chosen from a random uniform distribution between $\pm 1$ .

The last choice is typical of reservoir computer simulations that have been published.

Figure 3 shows the testing error $\Delta_{tx}$ as a function of fraction of edges flipped $\varepsilon_{f}$ for the first three cases. The fourth situation, where the input vector and network were random, is also plotted on the same axes for comparison. When all the elements of the input vector are +1, the testing error is much larger than when the elements are $\pm 1$ . Choosing elements of the input vector from a random distribution gives a smaller testing error when $\varepsilon_{f}$ is small, but for larger values of $\varepsilon_{f}$ the testing errors are in the same range as when all elements are $\pm 1$ . When the network edges and input vector elements are all chosen from a random distribution, the testing error is in the same range as when $\varepsilon_{f}=0.5$ .

Figure 3 does show that the standard practice of choosing all network edges randomly does give the best result, but when the fraction of edges flipped is 50%, our networks of all $\pm 1$ work just as well, so while networks need some randomness, they do not have to be completely random.

Figure 3 indicates that just using a completely random network is a good choice, but there may be restrictions on choosing network edges in experimental situations, so it is important to know how to create good networks that are not random. Finally, without a theory, the only way to know if a network choice is optimum is to simulate different networks and measure their performance.

VII.2 Testing error vs. Number of Symmetries

VII.2.1 Polynomial Nodes

These simulations used a reservoir computer with the polynomial nodes (eq. 1). The value of $\lambda$ was chosen to minimize testing error by matching the time scale of the reservoir to the time scale of the input signal. Figure 4 shows the log base 10 of testing error $\Delta_{tx}$ as a function of the log base 10 of the number of symmetries $\zeta_{s}$ for a reservoir computer with polynomial nodes when the input signal $s(t)$ is the Lorenz $x$ signal and the training signal $g(t)$ is the Lorenz $z$ signal. For this combination of input and training signals, $\lambda=1.4$ .

Figure 5 shows the testing error for the reservoir computer with polynomial nodes when the input signal $s(t)$ is the nonlinear map signal $x(k)$ (eq. 17) and the training signal $g(t)$ is the nonlinear map signal $y(k)$ . The value of $\lambda$ was set to 5 to minimize the testing errors.

Comparing figures 4 and 5, in both cases the testing error $\Delta_{tx}$ increases as the number of symmetries $\zeta_{s}$ increases, so that when the are more ways that the individual nodes can be re-arranged without changing the network structure, the testing error is larger.

VII.2.2 Leaky Tanh Nodes

The reservoir computer with leaky tanh nodes (eq. 2) follows the same trends as the reservoir computer with polynomial nodes, but some of the details are different. When the input signal was the Lorenz $x$ signal, the parameter $\alpha$ in eq. (2) was set to 0.35 and the spectral radius was set to 1.0. These parameters were found by minimizing the testing error.

Figure 7 shows the testing error when the input signal is the nonlinear map signal $x(k)$ (eq. 17) and the training signal $g(t)$ is the nonlinear map signal $y(k)$ . In this case, the log-log plot of testing error vs number of symmetries is not linear, but we do not know why the plot is nonlinear.

VII.3 Testing error vs. Fraction of Edges Flipped

If more than 100 edges were flipped in the network used in the previous section, we could find only one symmetry, the identity. For larger numbers of flipped edges, we plot the testing error $\Delta_{tx}$ as a function of the fraction of edges flipped from +1 to -1, $\varepsilon_{f}$ .

VII.3.1 Polynomial and Linear Nodes

In this section we consider two types of node. We study the polynomial nodes of eq. (1) and a reservior computer with linear nodes. The linear nodes are also described by eq. (1), but with parameters $p_{1}=-3$ , $p_{2}=0$ and $p_{3}=0$ . The difference in the testing error $\Delta_{tx}$ will reveal how the reservoir computer performance depends on nonlinearity.

Figure 8 shows the testing error $\Delta_{tx}$ vs. the fraction of edges flipped $\varepsilon_{f}$ for a reservoir computer when the nodes were described by the polynomial of eq. (1). The input signal $s(t)$ was the Lorenz $x$ signal, while the training signal $g(t)$ was the Lorenz $z$ signal. The figure also shows the testing error when the nodes were linear, that is $p_{1}=-3$ , $p_{2}=0$ and $p_{3}=0$ .

As a larger fraction of edges in the network are flipped from +1 to -1, the testing error plotted in figure 8 decreases if the nodes are polynomial nodes. If the nodes were linear, figure 8 shows that the testing error $\Delta_{tx}$ did not decrease. Nonlinearity is necessary for the reservoir computer to fit the Lorenz $z$ signal when the input signal was the Lorenz $x$ signal.

A possible explanation for why flipping more edges in the network from +1 to -1 reduces the testing error $\Delta_{tx}$ is shown in figure 9, which shows the covariance rank $\Gamma$ (defined in eq. 13) of the reservoir variables ${\bf R}(t)$ as a function of fraction of edges flipped, $\varepsilon_{f}$ .

When the nodes are polynomial nodes, figure 9 shows that the rank $\Gamma$ of the covariance of the reservoir matrix $\Omega$ increases with the fraction of edges flipped, while the rank when the nodes are linear increases only slightly, with a maximum rank of 6. The linear reservoir variables span a much lower dimension than the polynomial reservoir variables, which may be why the polynomial reservoir does a better job of fitting the Lorenz $z(t)$ signal in figure 8.

When the input signal $s(t)$ for the reservoir comes from the random $x(k)$ signal from eq. (17) and the training signal is the $y(k)$ signal, figure 10 shows the testing error $\Delta_{tx}$ as a function of fraction of edges $\varepsilon_{f}$ in the network flipped from +1 to -1, for both polynomial nodes and linear nodes.

Figure 11 shows the covariance rank $\Gamma$ as a function of fraction of edges flipped, $\varepsilon_{f}$ when the reservoir input signal is the random $x(k)$ signal from eq. (17) and the training signal is the $y(k)$ signal.

Similar to when the polynomial nodes were driven by the Lorenz system, the testing error $\Delta_{tx}$ decreases as the fraction of edges flipped $\varepsilon_{f}$ increases for polynomial nodes, but not for linear nodes for the nonlinear map system. Figure 11 shows that once again, the covariance rank $\Gamma$ increases with the fraction of edges flipped for polynomial nodes, but increases only slightly for the linear nodes. The rank actually saturates at 100 for the polynomial nodes. The signal $x(k)$ is a random signal, so it makes sense that the reservoir ${\bf R}(t)$ would have a higher covariance rank when driven with the infinite dimensional random signal than when driven by the finite dimensional Lorenz signal.

VII.3.2 Leaky Tanh Nodes

The testing error $\Delta_{tx}$ as a function of fraction of edges flipped $\varepsilon_{f}$ when the input signal $s(t)$ for the reservoir with leaky tanh nodes was the Lorenz $x$ signal and the training signal $g(t)$ was the Lorenz $z$ signal is shown in figure 12.

The reservoir computer testing error in figure 12 shows a decreasing trend as the fraction of edges flipped increases, but the testing error increases between $\varepsilon_{f}=0.3$ to 0.4. The training error, also shown in figure 12, does not increase over this range. One possible reason for the larger testing error is that the reservoir may be less stable over this region of $\varepsilon_{f}$ , so it may be more sensitive to differences in the input signal.

Figure 13 shows the covariance rank $\Gamma$ as a function of fraction of edges flipped. The covariance rank saturates at its highest possible value for $\varepsilon_{f}>0.4$ .

Comparing figure 8 to figure 12, the testing error for the reservoir using leaky tanh nodes is higher than the testing error using polynomial nodes (when driven by the Lorenz system), even though the covariance rank for the leaky tanh nodes (figure 13) is higher than for the polynomial nodes (figure 9). Clearly a higher rank is related to a lower testing error only if the node type stays the same.

Figure 14 shows the testing error $\Delta_{tx}$ vs. the fraction of network edges $\varepsilon_{f}$ that have been flipped from +1 to -1 when the reservoir computer with leaky tanh nodes is driven by the random $x(k)$ signal from eq. (17).

As with the Lorenz input signal, when the input signal comes from the random system, the reservoir covariance rank is higher for the leaky tanh nodes but the testing error is also higher. Higher covariance rank leads to lower error when a particular input signal is used with a particular node type, but the same is not true for different combinations of node and input signal.

VII.3.3 Arbitrary Cutoff for Rank Calculation

In the previous sections, the rank of the covariance matrix was calculated using the MATLAB rank function. This function calculates the rank as the number of singular values above a tolerance of $\gamma_{r}={D_{\max}}\delta\left({{\sigma_{\max}}}\right)$ , where $D_{max}$ is the largest dimension of $\Omega$ and $\delta(\sigma_{max})$ is the difference between the largest singular value of $\Omega$ and the next largest double precision number. For the covariance matrices in the previous sections, $\gamma_{r}=1.42\times 10^{-12}$ for all parameters.

We may see how robust these rank results are by setting a different tolerance. Figure 16(a) replots the covariance ranks from the previous figures using the MATLAB tolerance of $\gamma_{r}=1.42\times 10^{-12}$ , while figure 16(b) shows the number of singular values for the covariance matrix above the arbitrary threshold of $1\times 10^{-6}$ times the largest singular value. This number is designated as ${\rm SV}_{1e-6}$ .

The ordering of the different curves in figure 16 is not the same for parts (a) and (b), but all curves do increase as the fraction of edges flipped, $\varepsilon_{f}$ , increases.

VII.4 Memory Capacity

Figure 17 shows that the memory capacity , as defined by eq (15), increases as the fraction of edges flipped increases. Theory has suggested that longer memory leads to improved computational accuracy in a reservoir computer Jaeger (2002); Marzen (2017). For individual node types, higher memory capacity does correlate with lower testing error, but figure 17 shows that the memory capacity for the leaky tanh nodes was higher than for the polynomial nodes, but the leaky tanh nodes gave larger testing errors. It is interesting to note that the rank for all four combinations of input signal and node type stops increasing above $\varepsilon_{f}>0.4$ , and the memory capacity also levels off above this value.

VIII Varying Sparsity and Fraction Flipped

A typical assumption for reservoir computers is that a sparse adjacency matrix is necessary to achieve the diversity of signals necessary for low training error. We test this assumption by varying both sparsity, defined as the fraction of network edges that are not zero, and the fraction of edges flipped $\varepsilon_{f}$ . Sparsity will be denoted as $\phi$ .

The top plot in figure 18 is a contour plot of the log base 10 testing error $\Delta_{tx}$ as the fraction of edges flipped and the sparsity are varied. The lower plot is the covariance rank $\Gamma$ . The plots in figure 18 show that the testing error gets smaller and covariance rank gets larger as the sparsity decreases, so choosing sparse networks can be a useful goal. These plots do show, however, that the influence of the number of edges flipped is much stronger. Figure 19 shows the same information for the polynomial nodes driven by the random signal $x(k)$ from eq. (17).

Figures 20 and 21 are contour plots for a network of leaky tanh nodes being driven by the Lorenz $x$ signal or the random signal $x(k)$ from eq. (17). These two figures show an asymmetry along the $\varepsilon_{f}$ axis, especially for the random driving signal. The equation for the leaky tanh nodes contains an offset of 1.0, so the equations are not symmetric about zero. Changing most of the network edges from +1 to -1 can produce a bias in the network signals, which will affect the dynamics because of the asymmetry in the leaky tanh node equation.

IX Conclusions

We have simulated reservoir computers where the edges between the reservoir computer nodes were all +1 or 0, and then changed the reservoir computer network by flipping some of these edges from +1 to -1. We have done this for different combinations of node type and input signals.

If a small fraction of the edges of the adjacency matrix were flipped from +1 to -1, the number of symmetries in the adjacency matrix could function as a similarity measure for the adjacency matrix. Adjacency matrices that had more symmetries led to reservoirs that had a lower covariance rank and larger errors in fitting a training signal.

When the only symmetry in the adjacency matrix was the identity, a different measure of the variation in the adjacency matrix was necessary. Because the adjacency matrices used in this work were simple, we could use the fraction $\varepsilon_{f}$ of elements flipped from +1 to -1 as a measure of this variation. Increasing $\varepsilon_{f}$ increased the rank of the covariance of the adjacency matrix, which in most cases led to a smaller error in fitting a training signal. Comparing to completely random networks, we found that our networks, with all edges $\pm 1$ , produced as small of a testing error as the completely random network,

We also investigated the relation between the fraction of edges flipped and memory capacity. We quantified memory capacity using the method of Jaeger (2002), where memory is measured by finding how well the reservoir can fit previous values of a random input signal. We found that within a particular node type, memory capacity, testing error and covariance rank were all correlated.

Studying testing error and covariance rank showed that having fewer nonzero edges in the network (lower sparsity) did produce a smaller testing error, but the effect of sparsity was not as strong as the effect of flipping more network edges.

We did not investigate the effect of different nonrandom network statistics on the behavior of the reservoir computer. Networks whose connections are not all $\pm 1$ will require different measures of diversity. There are other types of network statistics such as nearest neighbor networks, star, or ring networks, or networks with different weights for different connections. All of these types of networks may perform differently as their parameters are changed.

X References

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Jaeger (2001) H. Jaeger, German National Research Center for Information Technology GMD Technical Report 148 , 34 (2001).
2Natschlaeger et al. (2002) T. Natschlaeger, W. Maass, and H. Markram, Special Issue on Foundations of Information Processing of TELEMATIK 8 , 39 (2002).
3Lu et al. (2018) Z. Lu, B. R. Hunt, and E. Ott, Chaos: An Interdisciplinary Journal of Nonlinear Science 28 , 061104 (2018).
4Zimmermann and Parlitz (2018) R. S. Zimmermann and U. Parlitz, Chaos: An Interdisciplinary Journal of Nonlinear Science 28 , 043118 (2018).
5Antonik et al. (2018) P. Antonik, M. Gulina, J. Pauwels, and S. Massar, Physical Review E 98 , 012215 (2018).
6Lu et al. (2017) Z. Lu, J. Pathak, B. Hunt, M. Girvan, R. Brockett, and E. Ott, Chaos: An Interdisciplinary Journal of Nonlinear Science 27 , 041102 (2017).
7Jaeger and Haas (2004) H. Jaeger and H. Haas, Science 304 , 78 (2004).
8Jalalvand et al. (2018) A. Jalalvand, K. Demuynck, W. D. Neve, and J.-P. Martens, Neurocomputing 277 , 237 (2018).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Network Structure Effects in Reservoir Computers

Abstract

I Introduction

II Reservoir Computers

III The Input Coupling Vector W{\bf W}W

IV Characterizing the Adjacency Matrix AAA

V Analysis Methods

V.1 Symmetry

V.2 Covariance Rank of Ω\OmegaΩ

V.3 Memory

VI Input signals

VII Simulations: Flipping Network Edges

VII.1 Comparison to other networks

VII.2 Testing error vs. Number of Symmetries

VII.2.1 Polynomial Nodes

VII.2.2 Leaky Tanh Nodes

VII.3 Testing error vs. Fraction of Edges Flipped

VII.3.1 Polynomial and Linear Nodes

VII.3.2 Leaky Tanh Nodes

VII.3.3 Arbitrary Cutoff for Rank Calculation

VII.4 Memory Capacity

VIII Varying Sparsity and Fraction Flipped

IX Conclusions

X References

III The Input Coupling Vector ${\bf W}$

IV Characterizing the Adjacency Matrix $A$

V.2 Covariance Rank of $\Omega$