Spatiotemporal Local Propagation

Alessandro Betti; Marco Gori

arXiv:1907.05106·cs.LG·July 12, 2019

Spatiotemporal Local Propagation

Alessandro Betti, Marco Gori

PDF

Open Access

TL;DR

This paper introduces SpatioTemporal Local Propagation (STLP), a biologically plausible neural computation scheme based on variational principles, which operates locally in space and time without requiring backpropagation.

Contribution

It presents a novel variational framework for neural networks that achieves spatial and temporal locality, addressing biological plausibility issues of traditional backpropagation methods.

Findings

01

STLP does not require backpropagation of errors.

02

The scheme is local in both space and time.

03

It surpasses BPTT and RTRL in biological plausibility.

Abstract

This paper proposes an in-depth re-thinking of neural computation that parallels apparently unrelated laws of physics, that are formulated in the variational framework of the least action principle. The theory holds for neural networks that are also based on any digraph, and the resulting computational scheme exhibits the intriguing property of being truly biologically plausible. The scheme, which is referred to as SpatioTemporal Local Propagation (STLP), is local in both space and time. Space locality comes from the expression of the network connections by an appropriate Lagrangian term, so as the corresponding computational scheme does not need the backpropagation (BP) of the error, while temporal locality is the outcome of the variational formulation of the problem. Overall, in addition to conquering the often invoked biological plausibility missed by BP, the locality in both space…

Figures4

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Links between learning theory and classical mechanics.

Learning	Mechanics	Remarks
$(W, x)$	$u$	Weights and neuronal outputs are interpreted as generalized coordinates.
$(\dot{W}, \dot{x})$	$\dot{u}$	Weight variations and neuronal variations are interpreted as generalized velocities.
$𝒜 (x, W)$	$𝒮 (u)$	The cognitive action is the dual of the action in mechanics.

Equations65

S (u) := \frac{1}{2} \int_{Ω} ∣\nabla u (x) ∣^{2} ϖ (x) d x + F (u),

S (u) := \frac{1}{2} \int_{Ω} ∣\nabla u (x) ∣^{2} ϖ (x) d x + F (u),

rank G_{z} = r for all z \in R^{N} .

rank G_{z} = r for all z \in R^{N} .

S^{*} (u) = \frac{1}{2} \int_{Ω} ∣\nabla u (x) ∣^{2} ϖ (x) - λ_{j} (x) G^{j} (u (x)) d x + F (u),

S^{*} (u) = \frac{1}{2} \int_{Ω} ∣\nabla u (x) ∣^{2} ϖ (x) - λ_{j} (x) G^{j} (u (x)) d x + F (u),

- ϖ Δ u - u_{x^{α}} ϖ_{x^{α}} - λ_{ℓ} G_{z}^{ℓ} (u) + L_{F} (u) = 0,

- ϖ Δ u - u_{x^{α}} ϖ_{x^{α}} - λ_{ℓ} G_{z}^{ℓ} (u) + L_{F} (u) = 0,

- Δ u \cdot G_{z}^{j} (u) = G_{z^{i} z^{k}}^{j} (u) u_{x^{α}}^{i} u_{x^{α}}^{k}, 1 \leq j \leq r .

- Δ u \cdot G_{z}^{j} (u) = G_{z^{i} z^{k}}^{j} (u) u_{x^{α}}^{i} u_{x^{α}}^{k}, 1 \leq j \leq r .

A_{ij} (u) λ_{ℓ} = ϖ G_{z^{i} z^{k}}^{j} (u) u_{x^{α}}^{i} u_{x^{α}}^{k} - ϖ_{x^{α}} (u_{x^{α}} \cdot G_{z}^{j} (u)) + L_{F} (u) \cdot G_{z}^{j} (u),

A_{ij} (u) λ_{ℓ} = ϖ G_{z^{i} z^{k}}^{j} (u) u_{x^{α}}^{i} u_{x^{α}}^{k} - ϖ_{x^{α}} (u_{x^{α}} \cdot G_{z}^{j} (u)) + L_{F} (u) \cdot G_{z}^{j} (u),

\lambda_{\ell}=(A^{-1}(u))_{\ell j}\bigl{(}\varpi G^{j}_{z^{i}z^{k}}(u)u^{i}_{x^{\alpha}}u^{k}_{x^{\alpha}}-\varpi_{x^{\alpha}}(u_{x^{\alpha}}\cdot G^{j}_{z}(u))+L_{F}(u)G_{z}^{j}(u)\bigr{)}.

\lambda_{\ell}=(A^{-1}(u))_{\ell j}\bigl{(}\varpi G^{j}_{z^{i}z^{k}}(u)u^{i}_{x^{\alpha}}u^{k}_{x^{\alpha}}-\varpi_{x^{\alpha}}(u_{x^{\alpha}}\cdot G^{j}_{z}(u))+L_{F}(u)G_{z}^{j}(u)\bigr{)}.

G^{j} (τ, ξ, M) := {ξ^{j} - e^{j} (τ), ξ^{j} - σ (m_{j k} ξ^{k}) if 1 \leq j \leq ω; if ω < j \leq ν,

G^{j} (τ, ξ, M) := {ξ^{j} - e^{j} (τ), ξ^{j} - σ (m_{j k} ξ^{k}) if 1 \leq j \leq ω; if ω < j \leq ν,

{\vbox{\vbox{\hbox{\includegraphics{./const-5.mps}}}}\atop\hbox{(a)}}\qquad\qquad{\vbox{\vbox{\hbox{\includegraphics{./const-3.mps}}}}\atop\hbox{(b)}}

{\vbox{\vbox{\hbox{\includegraphics{./const-5.mps}}}}\atop\hbox{(a)}}\qquad\qquad{\vbox{\vbox{\hbox{\includegraphics{./const-3.mps}}}}\atop\hbox{(b)}}

A (x, W) := \int \frac{1}{2} (m_{x} ∣ \overset{x}{˙} (t) ∣^{2} + m_{W} ∣ \dot{W} (t) ∣^{2}) ϖ (t) d t + F (x, W),

A (x, W) := \int \frac{1}{2} (m_{x} ∣ \overset{x}{˙} (t) ∣^{2} + m_{W} ∣ \dot{W} (t) ∣^{2}) ϖ (t) d t + F (x, W),

G^{j} (t, x (t), W (t)) = 0, 1 \leq j \leq ν .

G^{j} (t, x (t), W (t)) = 0, 1 \leq j \leq ν .

G_{ξ^{i}}^{j} (τ, ξ, M) = {δ_{ij}, δ_{ij} - σ^{'} (m_{j k} ξ^{k}) m_{j i} if 1 \leq j \leq ω; if ω < j \leq ν,

G_{ξ^{i}}^{j} (τ, ξ, M) = {δ_{ij}, δ_{ij} - σ^{'} (m_{j k} ξ^{k}) m_{j i} if 1 \leq j \leq ω; if ω < j \leq ν,

(G_{ξ^{i}}^{j} (τ, ξ, M)) = 10 ⋮ 0 * 1 ⋮ 0 \dots \dots ⋱ \dots * * ⋮ 1,

(G_{ξ^{i}}^{j} (τ, ξ, M)) = 10 ⋮ 0 * 1 ⋮ 0 \dots \dots ⋱ \dots * * ⋮ 1,

G^{j} (τ, ξ, M, ζ) := {ξ^{j} - e^{j} (τ) + ζ^{j}, ξ^{j} - σ (m_{j k} ξ^{k}) + ζ^{j} if 1 \leq j \leq ω; if ω < j \leq ν .

G^{j} (τ, ξ, M, ζ) := {ξ^{j} - e^{j} (τ) + ζ^{j}, ξ^{j} - σ (m_{j k} ξ^{k}) + ζ^{j} if 1 \leq j \leq ω; if ω < j \leq ν .

A (x, W, s) := \int \frac{1}{2} (m_{x} ∣ \overset{x}{˙} (t) ∣^{2} + m_{W} ∣ \dot{W} (t) ∣^{2} + m_{s} ∣ \overset{s}{˙} (t) ∣^{2}) ϖ (t) d t + F (x, W, s),

A (x, W, s) := \int \frac{1}{2} (m_{x} ∣ \overset{x}{˙} (t) ∣^{2} + m_{W} ∣ \dot{W} (t) ∣^{2} + m_{s} ∣ \overset{s}{˙} (t) ∣^{2}) ϖ (t) d t + F (x, W, s),

A^{*} (X, W) = \int \frac{1}{2} (m_{x} ∣ \overset{x}{˙} (t) ∣^{2} + m_{W} ∣ \dot{W} (t) ∣^{2}) ϖ (t) - λ_{j} (t) G^{j} (t, x (t), W (t)) d t + F (x, W),

A^{*} (X, W) = \int \frac{1}{2} (m_{x} ∣ \overset{x}{˙} (t) ∣^{2} + m_{W} ∣ \dot{W} (t) ∣^{2}) ϖ (t) - λ_{j} (t) G^{j} (t, x (t), W (t)) d t + F (x, W),

- m_{x} ϖ (t) \overset{x}{¨} (t) - m_{x} \overset{ϖ}{˙} (t) \overset{x}{˙} (t) - λ_{j} (t) G_{ξ}^{j} (x (t), W (t)) + L_{F}^{x} (x (t), W (t)) = 0;

- m_{x} ϖ (t) \overset{x}{¨} (t) - m_{x} \overset{ϖ}{˙} (t) \overset{x}{˙} (t) - λ_{j} (t) G_{ξ}^{j} (x (t), W (t)) + L_{F}^{x} (x (t), W (t)) = 0;

- m_{W} ϖ (t) \ddot{W} (t) - m_{W} \overset{ϖ}{˙} (t) \dot{W} (t) - λ_{j} (t) G_{M}^{j} (x (t), W (t)) + L_{F}^{W} (x (t), W (t)) = 0,

\displaystyle\Bigl{(}{G^{i}_{\xi^{a}}G^{j}_{\xi^{a}}\over m_{x}}+{G^{i}_{m_{ab}}G^{j}_{m_{ab}}\over m_{W}}\Bigr{)}\lambda_{j}=

\displaystyle\Bigl{(}{G^{i}_{\xi^{a}}G^{j}_{\xi^{a}}\over m_{x}}+{G^{i}_{m_{ab}}G^{j}_{m_{ab}}\over m_{W}}\Bigr{)}\lambda_{j}=

\displaystyle+G^{i}_{\xi^{a}\xi^{b}}\dot{x}^{a}\dot{x}^{b}+G^{i}_{m_{ab}m_{cd}}\dot{w}_{ab}\dot{w}_{cd}\bigr{)}

- \overset{ϖ}{˙} (\overset{x}{˙}^{a} G_{ξ^{a}}^{i} + \overset{w}{˙}_{ab} G_{m_{ab}}^{i}) + \frac{L _{F}^{x^{a}} G _{ξ^{a}}^{i}}{m _{x}} + \frac{L _{F}^{w_{ab}} G _{m_{ab}}^{i}}{m _{W}},

G_{τ}^{i} (0, x (0), W (0)) + G_{ξ^{a}}^{i} (0, x (0), W (0)) \overset{x}{˙}^{a} (0) + G_{m_{ab}}^{i} (0, x (0), W (0)) \overset{w}{˙}_{ab} (0) = 0.

G_{τ}^{i} (0, x (0), W (0)) + G_{ξ^{a}}^{i} (0, x (0), W (0)) \overset{x}{˙}^{a} (0) + G_{m_{ab}}^{i} (0, x (0), W (0)) \overset{w}{˙}_{ab} (0) = 0.

G_{τ}^{i} (0, x (0), W (0)) = 0,

G_{τ}^{i} (0, x (0), W (0)) = 0,

G^{i} (0, x (0), W (0)) = 0, i = 1, \dots, ν;

G^{i} (0, x (0), W (0)) = 0, i = 1, \dots, ν;

G_{τ}^{i} (0, x (0), W (0)) = 0, i = 1, \dots, ν;

\overset{x}{˙} (0) = 0;

\dot{W} (0) = 0.

F (t, x (t), \overset{x}{˙} (t), \overset{x}{¨} (t), W (t), \dot{W} (t), \ddot{W} (t)) = - e^{ϑt} V (x (t), y (t)),

F (t, x (t), \overset{x}{˙} (t), \overset{x}{¨} (t), W (t), \dot{W} (t), \ddot{W} (t)) = - e^{ϑt} V (x (t), y (t)),

V (x (t), y (t)) := \frac{1}{2} i = 1 \sum η (y^{i} (t) - x^{ν - η + i} (t))^{2},

V (x (t), y (t)) := \frac{1}{2} i = 1 \sum η (y^{i} (t) - x^{ν - η + i} (t))^{2},

\overset{ˉ}{E} (t) := n = 0 \sum \infty E_{n} χ_{[t_{n - 1}, t_{n}]} (t), \overset{y}{ˉ} (t) := n = 0 \sum \infty y_{n} χ_{[t_{n - 1}, t_{n}]} (t),

\overset{ˉ}{E} (t) := n = 0 \sum \infty E_{n} χ_{[t_{n - 1}, t_{n}]} (t), \overset{y}{ˉ} (t) := n = 0 \sum \infty y_{n} χ_{[t_{n - 1}, t_{n}]} (t),

E (t) := (\overset{ˉ}{E} * R) (t), and y (t) := (\overset{y}{ˉ} * R) (t),

E (t) := (\overset{ˉ}{E} * R) (t), and y (t) := (\overset{y}{ˉ} * R) (t),

\int e^{\vartheta t}\Bigl{(}\frac{m_{W}}{2}|\dot{W}|^{2}-\overline{V}(t,W(t))\Bigr{)}\,dt

\int e^{\vartheta t}\Bigl{(}\frac{m_{W}}{2}|\dot{W}|^{2}-\overline{V}(t,W(t))\Bigr{)}\,dt

\ddot{W} (t) + ϑ \dot{W} (t) = - \frac{1}{m _{W}} \overline{V}_{W} (t, W (t)),

\ddot{W} (t) + ϑ \dot{W} (t) = - \frac{1}{m _{W}} \overline{V}_{W} (t, W (t)),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Neural dynamics and brain function

Full text

Spatiotemporal Local Propagation

Alessandro Betti

University of Florence

Florence, Italy

[email protected]

&Marco Gori

SAILab, University of Siena

Siena, Italy

[email protected]

Abstract

This paper proposes an in-depth re-thinking of neural computation that parallels apparently unrelated laws of physics, that are formulated in the variational framework of the least action principle. The theory holds for neural networks that are also based on any digraph, and the resulting computational scheme exhibits the intriguing property of being truly biologically plausible. The scheme, which is referred to as SpatioTemporal Local Propagation (STLP), is local in both space and time. Space locality comes from the expression of the network connections by an appropriate Lagrangian term, so as the corresponding computational scheme does not need the backpropagation (BP) of the error, while temporal locality is the outcome of the variational formulation of the problem. Overall, in addition to conquering the often invoked biological plausibility missed by BP, the locality in both space and time that arises from the proposed theory can neither be exhibited by Backpropagation Through Time (BPTT) nor by Real-Time Recurrent Learning (RTRL).

1 Introduction

Since mid-eighties, the explosion of interest in neural computation has been mostly fueled by finite-dimensional optimization in the space of the connection weights. Because of the typical large number of parameters involved, gradient-descent methods have dominated the searching heuristics. Moreover, it early became clear that large-scale problems can only be faced thanks to stochastic gradient descent (SGD) and related algorithms, that represent the on-line side of classic batch-mode optimization schemes. To some extent, SGD gives learning a sort of temporal dimension. If one assumes that the examples come at discrete time then weight updating takes place, for each example, at each temporal step. As early pointed out in the seminal PDP book (1) (p.324), on-line learning can be regarded as an approximation of gradient descent of the error function. In their words

By changing the weights after each pattern is presented we depart to some extent from a true gradient descent in $E$ . Nevertheless, provided the learning rate (i.e. the constant of proportionality) is sufficiently small, this departure will be negligible and the delta rule will implement a ver close approximation to gradient-descent in sum-squared error.

The resulting on-line process, along with mini-batch versions, have been the subject of in-depth recent investigations (e.g. (2)) that have contributed to shed light on SGD and on many specific versions that have been massively using in machine learning.

This paper is motivated by the curiosity of providing a truly new foundation of learning, where “time” is regarded as an intrinsic variable for the acquisition of concepts. Basically, we propose a formulation of learning by differential equations instead of by the dominating approach of using finite-dimensional optimization. This can be traced back to a number of relevant contributions, including (3) and (4), as well as to a recent interesting continuous-based formulation of deep learning (5). While following this track, this paper proposes a new view of learning, that can regarded as the outcome of laws of nature. We use the unifying view that arises from physics when using variational calculus and, particularly, when deriving laws of nature as stationary points of the action. We establish a parallel with mechanics according to which particle positions is associated with the neural parameters (weights and outputs), so as the velocity turns out to indicate the rate of the learning process (see Table 1). The kinetic energy has related meaning, while the potential energy indicates the degree of satisfaction of the environmental constraints – for instance, in case of supervised learning the potential turns out to be a loss function. Interestingly, the presence of motion constraints on particles has a counterpart in the constraints that express the neural model, so as “learning motion” is a stationary point of a functional, referred to as the cognitive action, under the neural architectural constraints. Like in mechanics, the resulting solution is a differential equations in the Lagrangian variables that dictates the evolution of the weights and of the neural outputs. The learning behavior reminds us of damped oscillators and the process of dissipation leads to ordered configurations which correspond to the outcome of learning. As dissipation increases the proposed theory leads to solutions that approaches classic gradient descent.

The most striking results of the theory are deeply rooted into the variational formulation under the subsidiary conditions, that represent the neural constraints. It is shown that STLP exhibits locality in both space and time for neural networks defined by any digraph. Temporal locality is basically the outcome of the variational optimization that yields models based on differential equation. Interestingly, space locality turns out to be the outcome of imposing the stationarity of the cognitive action under neural constraints. The issue of biological plausibility has been recently the subject of a related investigation in (6).

The message that emerges from the paper is that in order to gain a truly biological plausibility, temporal locality and strong space locality must be supported. On the other hand, classic algorithms for gradient computation in recurrent neural network do not exhibit this property: neither BPTT (BackPropation Through Time) nor Real-Time Recurrent Learning (RTRL) possess space and temporal locality. BPTT is local in space, but not in time, whereas RTRL is local in time, but not in space (7). Moreover, in these classic algorithms, space locality refers to the property gained by the backpropation factorization, not to strong space locality that is gained by STLP. Overall, the proposed theory stimulates a re-thinking of neural computation driven by laws of nature, where there is no distinction between learning and test, where the weight updating is paired with computation of the output in the learning environment. The theory also opens the doors for an in-depth reformulation of learning algorithms.

2 Modified Dirichlet problem

Let $\Omega$ be an open, bounded domain in ${\bf R}^{n}$ , let $u\colon\Omega\to{\bf R}^{N}$ , $\varpi\in{\cal C}^{1}(\Omega;(0,+\infty))$ and $\mathscr{F}(u):=\int_{\Omega}F(x,u,\nabla u)\,dx$ be, and define the following functional

[TABLE]

which is a weighted Dirichlet integral plus the $\mathscr{F}(u)$ term. We are here interested in the necessary conditions for $u$ to be an extremizer of the modified Dirichlet functional (1) subject to a class of holonomic constraints of the form 111In this section, for the sake of simplicity, we are considering holonomic constraints that do not depend explicitly on the independent variable $x$ , however the arguments presented here can be readily generalized also to include the case $G(x,u(x))=0$ . Indeed in Section 3 we will consider general holonomic constraints. $G(u(x))=0$ . Let us consider the problem in Eq. (1) where $u$ is subject to the constraints $G(u(x))=0$ for all $x\in\overline{\Omega}$ and $G(z)$ of class ${\cal C}^{2}({\bf R}^{N},{\bf R}^{r})$ . Furthermore assume that the $r\times N$ Jacobian matrix defined by $G_{z}=(G^{i}_{z^{j}})$ satisfies222We could ask for a less restrictive condition here, namely that $G_{z}(z)$ should be full rank on all the points $z\in{\bf R}^{N}$ such that $G(z)=0$ .:

[TABLE]

From the theory of calculus of variation with subsidiary conditions (see (8) Chap. 2) we know that there exist $\lambda_{1},\dots,\lambda_{r}\in{\cal C}^{0}(\Omega)$ such that the constrained stationary points of $\mathscr{S}$ coincides with the unconstrained stationary points of the extended functional

[TABLE]

and the Euler equation for this functional are 333throughout this paper we will adopt Einstein summation convention; that is to say two repeated indices, unless otherwise specified imply summation.

[TABLE]

where $L_{F}(u)$ is the Euler operator ((8) p. 18 and (9)). Differentiating the constraints two times with respect to $x^{\alpha}$ one obtaines

[TABLE]

Hence if we scalar multiply Euler equation by $G_{z}^{j}(u)$ we obtain

[TABLE]

where we defined $A_{j\ell}(u):=G_{z}^{j}(u)\cdot G^{\ell}_{z}(u)$ . Therefore Euler equations for the constrained functional (3) are (4) with

[TABLE]

Notice that in order to get from Eq. (6) to (7) we need to know that $A$ is invertible. Whenever our assumption (2) holds the Gram matrix $A(u)$ turns out to be a invertible in view of the following (well known) lemma:

Lemma 1.

If $v_{1}$ ,…, $v_{n}$ are $n$ linear independent vectors, then the Gram matrix $G_{ij}:=(v_{i},v_{j})$ is positive definite.

Proof.

$(x,Gx)=x_{i}(v_{i},v_{j})x_{j}=(x_{i}v_{i},x_{j}v_{j})=\|v_{i}x_{i}\|^{2}\geq 0$ . However $\|v_{i}x_{i}\|=0$ if and only if $x_{i}v_{i}=0$ , therefore we can conclude that $(x,Gx)>0$ for every $x\neq 0$ . ∎

3 Neural network constraints

The typical learning paradigm within the framework of NN consists of a model, that depends on a set of parameters $W$ , together with an update rule for the parameters; this rule is usually a gradient descent of a function that measure the goodness of the model on a specific learning task. However in this section we will show that when the dynamics of the parameters $W$ is described by laws that comes from stationarity conditions of a functional, as it happens for canonical coordinates in classical mechanics (see Table 1), then the NN model can be treated using the theory of constraints described in the previous sections. As an immediate consequence of this approach the learning process gains temporal and spatial locality even in the case of recurrent NN.

First of all let us describe the architecture of the models that we will address. Given a simple digraph $D=(V,A)$ of order $\nu$ without loss of generality we can assume $V=\{1,2,\dots,\nu\}$ and $A=\{(i,j)\in{\bf N}^{2}\mid i\in V,j\in V\}$ . A neural network constructed on $D$ consists of a set of maps444Please notice that now $x$ is a the variable of the variational problem, and therefore represent a mapping $t\mapsto x(t)$ . It not to be intended as the independent variable of the problem described in the previous sections. $i\in V\mapsto x^{i}\in{\bf R}$ and $(i,j)\in A\mapsto w_{ij}\in{\bf R}$ together with $\nu$ constraints $G^{j}(x,W)=0$ $j=1,2,\dots\nu$ where $(W)_{ij}=w_{ij}$ . Let ${\cal M}_{\nu}({\bf R})$ be the set of all $\nu\times\nu$ real matrices and ${\cal M}^{\downarrow}_{\nu}({\bf R})$ the set of all $\nu\times\nu$ strictly lower triangular matrices over ${\bf R}$ . If $W\in{\cal M}^{\downarrow}_{\nu}({\bf R})$ we say that the NN has a feedforward structure. In this paper we will consider both feedforward NN and NN with cycles. The relations $G^{j}=0$ for $j=1,\dots,\nu$ specify the computational scheme with which the information diffuses trough the network. In a typical network with $\omega$ inputs these constraints are defined as follows (see also Fig. 1): For any vector $\xi\in{\bf R}^{\nu}$ , for any matrix $M\in{\cal M}_{\nu}({\bf R})$ with entries $m_{ij}$ and for any given ${\cal C}^{1}$ map $e\colon(0,+\infty)\to{\bf R}^{\omega}$ we define the constraint on neuron $j$ when the example $e(\tau)$ is presented to the network as

[TABLE]

where $\sigma\colon{\bf R}\to{\bf R}$ is of class ${\cal C}^{2}({\bf R})$ .

Our goal here is to show that such relations, that normally are considered just a local description of the compositional structure of the NN, once properly interpreted as constraints in the space $x-W$ (see Fig. 1) are suitable holonomic subsidiary conditions in the sense of (2).

Like in the case of classical mechanics, when dealing with learning processes we are interested in the temporal dynamics of the variables when they are exposed to the data from which the learning is supposed to happen. For this reason in this section we can restrict ourselves to the case $n=1$ and regard this variable as time ( $x^{1}=t$ ). Moreover because the neural constraints $G^{j}(x,W)=0$ involve not only $W$ but also $x$ the $N$ variables $u_{1},\dots,u_{N}$ split into $x\in{\bf R}^{\nu}$ and $W\in{\cal M}_{\nu}({\bf R})$ .

Feedforward Networks. Now let us consider the case $W\in{\cal M}^{\downarrow}_{\nu}({\bf R})$ and let us extend the theory described in Eq. (1) by allowing $\mathscr{F}(x,W):=\int F(t,x,\dot{x},\ddot{x},W,\dot{W},\ddot{W})\,dt$ , so that, in the end, we consider the functional

[TABLE]

subject to the constraints

[TABLE]

Then the following proposition holds true:

Proposition 1.

The matrix $({G_{\xi}\atop\overline{G_{M}}})\in{\cal M}_{(\nu^{2}+\nu)\times\nu}({\bf R})$ is full rank.

Proof.

First of all notice that if $(G_{\xi})_{ij}=G^{j}_{\xi^{i}}$ is full rank also $({G_{\xi}\atop\overline{G_{M}}})$ has this property. Then, since

[TABLE]

we immediately notice that $G^{i}_{\xi^{i}}=1$ and that for all $i>j$ we have $G^{i}_{\xi^{i}}=0$ . This means that

[TABLE]

which is clearly full rank. ∎

Notice that this result heavily depends on the assumption $W\in{\cal M}^{\downarrow}_{\nu}({\bf R})$ ; however we will now discuss how the introduction of an additional variable that models the degree of satisfaction of the neural constraints acts as a regularizer of constraints (8) and ensure the satisfaction of (2).

Recurrent networks. Let us suppose that we also assign to each neuron a variable $s$ that measure the degree of violation of the constraint. Then Eq. (10) assumes the form $G^{j}(t,x,W,s)=0$ , $j=1,2,\dots,\nu$ where

[TABLE]

In doing so it is important to notice that Proposition 1 holds without the assumption that $M\in{\cal M}^{\downarrow}_{\nu}({\bf R})$ as it is immediate to prove since $G^{j}_{\zeta^{i}}=\delta_{ij}$ , which is of course full rank. This important remark opens the possibility to extend the theory to networks with “feedback” connections based on general simple digraphs.

In this formulation of the theory the action, described in Eq. (9) must be modified to take into account of the introduction of the new variable $s$ :

[TABLE]

where $\mathscr{F}(x,W,s):=\int F(t,x,\dot{x},\ddot{x},W,\dot{W},\ddot{W},s)\,dt$ .

4 Cognitive action and laws of learning: Feedforward architecture

In the previous section we concentrated ourselves on showing that the set of constraints that define a NN are good constraints (in the sense of (2)). In this section we will focus on the feedforward case described by the functional (9) together with constraints (10). In particular we will discuss the updates rules (Euler-Lagrange equations) for the variables $x$ and $W$ derived from the stationarity conditions of the functional (9). We notice in passing that when imposing the stationarity of action $\delta\mathscr{A}=0$ we give rise to a computational model that, in general, is remarkably different from classic optimization approaches used in machine learning, that are typically driven by the gradient heuristics. Basically, the models arising from $\delta\mathscr{A}=0$ , instead of gradually reducing the action from its initial value, satisfy this condition for any time instant, thus resembling what happens for Newtonian’s laws.

We begin by deriving the constrained Euler-Lagrange (EL) equations associated with the functional (9) under subsidiary conditions (10). The constrained functional is

[TABLE]

and its EL-equations thus read

[TABLE]

where $L_{F}^{x}=F_{x}-d(F_{\dot{x}})/dt+d^{2}(F_{\ddot{x}})/dt^{2}$ , $L_{F}^{W}=F_{W}-d(F_{\dot{W}})/dt+d^{2}(F_{\ddot{W}})/dt^{2}$ are the functional derivatives of $F$ with respect to $x$ and $W$ respectively (see (9)). An expression for Lagrange multiplies, as it is explained in Section 2 is derived by differentiating two times the constraint with respect to the time and using the obtained expression to substitute the second order terms in the Euler equations. In this case the analogue of Eq. (6) is

[TABLE]

where $G^{i}_{\tau}$ , $G^{i}_{\tau\tau}$ , $G^{i}_{\xi^{a}}$ , $G^{i}_{\xi^{a}\xi^{b}}$ , $G^{i}_{m_{ab}}$ and $G^{i}_{m_{ab}m_{cd}}$ are the gradients and the hessians of constraint (10).

**Initial conditions. ** Suppose now that we want to solve Eq. (14)–(15) with Cauchy initial conditions. Of course we must choose $W(0)$ and $x(0)$ such that $g_{i}(0)\equiv 0$ , where we posed $g_{i}(t):=G^{i}(t,x(t),W(t))$ , for $i=1,\dots,\nu$ . However since the constraint must hold also for all $t\geq 0$ we must also have at least $g^{\prime}_{i}(0)=0$ . These conditions written explicitly means

[TABLE]

If the constraints does not depend explicitly on time it is sufficient to to choose $\dot{x}(0)=0$ and $\dot{W}(0)=0$ , while for time dependent constraint this condition leaves

[TABLE]

which is an additional constraint on the initial conditions $x(0)$ and $W(0)$ to be satisfied. Therefore one possible consistent way to impose Cauchy conditions is

[TABLE]

Higher derivative of $g_{i}(0)$ becomes automatically satisfied thanks through the differential equations.

Supervised Learning and reduction to BP. In order to see how this theory can be readily applied to learning let us restrict ourselves to the case $W\in{\cal M}^{\downarrow}_{\nu}({\bf R})$ and choose $\varpi(t)=\exp(\vartheta t)$ , $\vartheta>0$ , $m>0$ . Now let us choose

[TABLE]

where $y(t)$ is an assigned supervision signal and

[TABLE]

$x^{\nu-\eta},\dots,x^{\nu}$ being the variables associated with the outputs neurons. A typical input signal and the corresponding supervision signal can be constructed from a standard training set $\mathscr{L}:=\{(e_{\kappa},d_{\kappa})\mid e_{\kappa}\in{\bf R}^{\omega},d_{\kappa}\in{\bf R}^{\eta},\kappa=1,\dots,\ell\}$ in the following manner. Choose a sequence of times $\langle t_{n}\rangle:=t_{0},t_{1},t_{2},\dots$ such that $|t_{i+1}-t_{i}|=:\tau$ is constant $i\in{\bf N}$ . Furthermore define the following sequences: $\langle E_{n}\rangle:=e_{1},\dots,e_{\ell},e_{1},\dots e_{\ell},\dots$ and $\langle y_{n}\rangle:=d_{1},\dots,d_{\ell},d_{1},\dots d_{\ell},\dots$ . Let $R(t):=\sum_{n=0}^{\infty}\rho_{\epsilon}(t-t_{n})$ , where $\rho_{\epsilon}(\cdot)$ are standard Friedrichs mollifiers and define

[TABLE]

where $\chi_{A}$ is the characteristic function of the set $A$ and $t_{-1}=0$ . Then the signal

[TABLE]

is piecewise constant signals with smooth transitions. The temporal behaviour of these signals is depicted in the side figure.

To understand the behaviour of the Euler equations (14) and (15) we observe that in the case of feedforward networks, as it is well known, the constraints $G^{j}(t,x,W)=0$ can be solved for $x$ so that eventually we can express the value of the output neurons in terms of the value of the input neurons. If we let $f^{i}_{W}(e(t))$ be the value of $x^{\nu-i}$ when $x^{1}=e^{1}(t),\dots,x^{\omega}=e^{\omega}(t)$ , then the theory defined by (9) under subsidiary conditions (10) is equivalent, when $m_{x}=0$ , to the unconstrained theory defined by

[TABLE]

where $\overline{V}(t,W(t)):=\frac{1}{2}\sum_{i=1}^{\eta}(y^{i}(t)-f^{i}_{W}(E(t)))^{2}$ . The Euler equations associated with (18) are

[TABLE]

that in the limit $\vartheta\to\infty$ and $\vartheta m\to\gamma$ reduces to the gradient method

[TABLE]

with learning rate $1/\gamma$ . Notice that the presence of the term $\varpi(t)$ that we proposed in the general theory it is essential in order to have a learning behaviour as it produce dissipation.

Typically the term $\overline{V}_{W}(t,W(t))$ in Eq. (20) can be evaluated using the Backpropagation algorithm; we will now show that Eq. (14)–(16) in the same limit used above $m_{x}\to 0$ , $m_{W}\to 0$ , $m_{x}/m_{W}\to 0$ reproduces Eq. (20) where the term $\overline{V}_{W}(t,W(t))$ explicitly assumes the form prescribed by BP. In order to see this choose $\vartheta=\gamma/m_{W}$ and multiply both sides of Eq. (14)–(16) by $\exp(-\vartheta t)$ , then take the limit $m_{x}\to 0$ , $m_{W}\to 0$ , $m_{x}/m_{W}\to 0$ . In this limit Eq. (15) and Eq. (16) becomes respectively

[TABLE]

where $\delta_{j}$ is the limit of $\exp(-\vartheta t)\lambda_{j}$ . Because the matrix $G^{i}_{\xi^{a}}G^{j}_{\xi^{a}}$ not only is invertible, but it is a Gram matrix if we define $T_{ij}:=G^{j}_{\xi^{i}}$ , then we have $G^{i}_{\xi^{a}}G^{j}_{\xi^{a}}=(T^{\prime}T)_{ij}$ . If we then pose $v_{i}:=-\gamma G^{i}_{\xi^{a}}\dot{x}^{a}$ , the $\delta$ ’s satisfies $T^{\prime}T\delta=v$ with $T$ that is an upper triangular matrix. Solving this equation is equivalent to the solution of $T^{\prime}\mu=v$ and $T\delta=\mu$ . From this last equation we immediately see that once $\mu$ is known $\delta$ is recursively derived starting from the output neurons. Finally, we can interpret $\delta_{i}$ in Eq. (22) as the delta-error, which is the recursively determined by Eq. (21) because of the special structure of of matrix $G^{i}_{\xi^{a}}G^{j}_{\xi^{a}}$ .

Optimal inversion of $A=G^{i}_{\xi^{a}}G^{j}_{\xi^{a}}$ . Since $A$ is Gram matrix, its inversion of $A$ , which is required for determining $\delta_{j}$ in Eq. 16, can be efficiently determined with an optimal complexity (10). Basically, we only need a number of dominant floating-point operation with grows quadratically with the dimension of $A$ .

4.1 Simulation of the dynamics

In order to prove the soundness of the proposed theory we performed some simulations of the Euler equations (14) and (15) in the special case $\omega=1$ , $\eta=1$ , $\varpi=\exp(\vartheta t)$ and $F=-\exp(\vartheta t)\overline{V}(t,x(t))$ , where in particular $\overline{V}(t,x(t))$ is taken to be a quadratic loss on the output neuron. To understand the learning dynamic of the weights we choose a constant supervision signal and various time-dependent input signals $e(t)$ . Figure 2 shows the evolution of the weight of a single linear neuron $x(t)=w(t)e(t)$ with a target $y=3$ and a variable input $e(t)$ . In Fig. 2–(a) $e(t)\to 3$ as $t\to\infty$ , and indeed $w(t)$ converges to $1$ . In Fig. 2–(b) $e(t)\approx 3(1-t)$ and consistently $w(t)\approx 1/(1-t)$ . Notice that in both cases the neuron constraint is always exactly satisfied. Remember that the initial conditions must be consistent with Eq. (17); in this example in Fig. 2–(a) we have $w(0)=0$ that guaranteed $G_{\tau}=0$ , while in the experiment relative to Fig. 2–(b) one can choose $\dot{w}(0)\neq 0$ as the condition $G_{\tau}=0$ is ensured by $\dot{e}(0)=0$ .

In Fig. 3 instead we tested the robustness of the method with respect to numerical errors by running the simulation for a longer period of time. The model here consists of two neurons NN with nonlinear activation function. We observed that due to numerical errors the system can fail to converge to the correct solution $w=1$ (Fig. 3–(a)). This can be understood as soon as we realize that, following the ideas of Section 2, EL-equations implements only the satisfaction of the second derivative of the constraints, therefore errors on the trajectories can shift the dynamic of the system on another constraint that differs from the correct one by a linear function of time. Hence, we found that such behaviour can be effectively corrected (see Fig. 3–(b)) by adding to the potential a quadratic loss on the constraint itself.

5 Conclusions

This paper proposes a novel formulation of learning by differential equations instead of by the dominating approach of using finite-dimensional optimization. This can be traced back many contributions that early appeared at the of the eighties (see e.g. (3) and (4)), as well as from the recent trend of emphasizing continuous-based computational models of learning (see e.g. (5)555Best student paper awards at NeurIPS 2018.. The distinctive view proposed in this paper consists of the close parallel with mechanics, that arises from the general principle of formulating variational laws of nature. The STLP computational scheme possesses the distinctive feature of being local in both space and time. Moreover, the gained space locality property goes beyond the classic local neural communication required for computing the gradient. Unlike BP, there is no need to synchronize the forward and backward step that return the factors of the gradient, since they are locally available. The theory nicely addresses classic arguments on BP biologically plausibility (11), and opens the doors to an in-depth reformulation of learning algorithms for both feedforward and recurrent neural networks.

Bibliography11

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) D.E. Rumelhart, J.L. Mc Clelland, and the PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition , volume 1. MIT Press, Cambridge, 1986.
2(2) Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review , 60(2):223–311, 2018.
3(3) F.J. Pineda. Generalization of back-propagation to recurrent neural networks. Physical Review Letters , 59:2229–2232, 1987.
4(4) B.A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural Computation , 1:263–269, 1989.
5(5) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Neur IPS , pages 6572–6583, 2018.
6(6) Yoshua Bengio, Dong-Hyun Lee, Jörg Bornschein, and Zhouhan Lin. Towards biologically plausible deep learning. Co RR , abs/1502.04156, 2015.
7(7) R.J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation , 1:270–280, 1989.
8(8) Mariano Giaquinta and Stefan Hildebrandt. Calculus of variations, vol. i. number 310 in a series of comprehensive studies in mathematics, 1996.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Spatiotemporal Local Propagation

Abstract

1 Introduction

2 Modified Dirichlet problem

Lemma 1**.**

Proof.

3 Neural network constraints

Proposition 1**.**

Proof.

4 Cognitive action and laws of learning: Feedforward architecture

4.1 Simulation of the dynamics

5 Conclusions

Lemma 1.

Proposition 1.