Multi-layered Spiking Neural Network with Target Timestamp Threshold   Adaptation and STDP

Pierre Falez; Pierre Tirilly; Ioan Marius Bilasco; Philippe Devienne,; Pierre Boulet

arXiv:1904.01908·cs.CV·April 4, 2019

Multi-layered Spiking Neural Network with Target Timestamp Threshold Adaptation and STDP

Pierre Falez, Pierre Tirilly, Ioan Marius Bilasco, Philippe Devienne,, Pierre Boulet

PDF

Open Access

TL;DR

This paper introduces a new threshold adaptation method for multi-layered spiking neural networks that improves classification accuracy and explores network sparsity, advancing energy-efficient neural computation.

Contribution

The paper presents a novel timestamp-based threshold adaptation system for multi-layered SNNs, achieving state-of-the-art classification accuracy with unsupervised learning.

Findings

01

Achieved 98.60% accuracy on MNIST with unsupervised SNN and SVM.

02

Achieved 99.46% accuracy on Faces/Motorbikes dataset.

03

Analyzed the impact of inhibition policies and STDP rules on network sparsity.

Abstract

Spiking neural networks (SNNs) are good candidates to produce ultra-energy-efficient hardware. However, the performance of these models is currently behind traditional methods. Introducing multi-layered SNNs is a promising way to reduce this gap. We propose in this paper a new threshold adaptation system which uses a timestamp objective at which neurons should fire. We show that our method leads to state-of-the-art classification rates on the MNIST dataset (98.60%) and the Faces/Motorbikes dataset (99.46%) with an unsupervised SNN followed by a linear SVM. We also investigate the sparsity level of the network by testing different inhibition policies and STDP rules.

Tables8

Table 1. TABLE I: Default SNN parameters used in the experiments. 𝒩 ( μ , σ ) 𝒩 𝜇 𝜎 \mathcal{N}(\mu,\sigma) is a normal distribution centered in μ 𝜇 \mu and with variance σ 𝜎 \sigma . 𝒰 ( a , b ) 𝒰 𝑎 𝑏 \mathcal{U}(a,b) is a uniform distribution in [ a , b ] 𝑎 𝑏 [a,b] .

Learning
$λ$	0.95	$N_{𝖾𝗉𝗈𝖼𝗁}$	100
STDP
$W_{𝗆𝗂𝗇}$	0.0	$W_{𝗆𝖺𝗑}$	1.0	$η_{W} (0)$	0.1
$β$	1.0	$τ$	0.1	$W (0)$	$\sim 𝒰 (0, 1)$
Neural Coding
$T_{𝗌𝗍𝖺𝗋𝗍}$	0.0	$T_{𝖾𝗇𝖽}$	1.0
Threshold Adaptation
$t_{𝗍𝖺𝗋𝗀𝖾𝗍}$	0.7	$η_{𝖳𝗁} (0)$	1.0	${Th}_{𝗆𝗂𝗇}$	1.0
$V_{𝖳𝗁} (0)$	$\sim 𝒩 (5, 1)$	$V_{𝗂𝗇𝗁}$	1.0
Pre-processing
${DoG}_{𝖼𝖾𝗇𝗍𝖾𝗋}$	1.0	${DoG}_{𝗌𝗎𝗋𝗋𝗈𝗎𝗇𝖽}$	4.0	${DoG}_{𝗌𝗂𝗓𝖾}$	7

Table 2. TABLE II: Architecture used with the MNIST dataset.

Type	Filter size	Filter number	Stride
Convolution	$5 \times 5$	32	1
Pooling	$2 \times 2$	32	2
Convolution	$5 \times 5$	128	1
Pooling	$2 \times 2$	128	2
Fully-connected	$4 \times 4$	4096	1

Table 3. TABLE III: Results on MNIST with different t 𝗍𝖺𝗋𝗀𝖾𝗍 subscript 𝑡 𝗍𝖺𝗋𝗀𝖾𝗍 t_{\sf target} variations. Δ t subscript Δ 𝑡 \Delta_{t} is the difference of t 𝗍𝖺𝗋𝗀𝖾𝗍 subscript 𝑡 𝗍𝖺𝗋𝗀𝖾𝗍 t_{\sf target} between consecutive layers. t 𝗍𝖺𝗋𝗀𝖾𝗍 subscript 𝑡 𝗍𝖺𝗋𝗀𝖾𝗍 t_{\sf target} of the first layer is fixed to 0.75 0.75 0.75 .

$Δ_{t}$	Recognition rate	Sparsity
-0.20	11.35 $\pm$ 00.00	0.0000 $\pm$ 0.0000
-0.10	85.56 $\pm$ 2.28	0.5129 $\pm$ 0.0230
-0.05	97.68 $\pm$ 0.14	0.2855 $\pm$ 0.0067
-0.01	98.36 $\pm$ 0.05	0.1568 $\pm$ 0.0068
0.0	98.47 $\pm$ 0.07	0.1365 $\pm$ 0.0052
+0.01	98.54 $\pm$ 0.10	0.1209 $\pm$ 0.0066
+0.05	98.43 $\pm$ 0.10	0.0754 $\pm$ 0.0082
+0.10	97.24 $\pm$ 0.24	0.0176 $\pm$ 0.0010
+0.20	92.43 $\pm$ 1.70	0.0004 $\pm$ 0.0016

Table 4. TABLE IV: Recognition rates on MNIST with the different inhibition policies for t 𝗍𝖺𝗋𝗀𝖾𝗍 = 0.75 subscript 𝑡 𝗍𝖺𝗋𝗀𝖾𝗍 0.75 t_{\sf target}=0.75 and biological STDP ( τ = 0.1 𝜏 0.1 \tau=0.1 ).

Inhibition policy	Layer	Recognition rate	Sparsity
Winner-take-all	Conv1	84.28 $\pm$ 0.98	0.3389 $\pm$ 0.0148
	Conv2	89.07 $\pm$ 0.74	0.6509 $\pm$ 0.0026
	FC	61.82 $\pm$ 1.92	1.0000 $\pm$ 0.0000
Soft inhibition	Conv1	85.47 $\pm$ 0.99	0.2806 $\pm$ 0.0443
	Conv2	96.14 $\pm$ 0.68	0.3984 $\pm$ 0.0171
	FC	94.86 $\pm$ 0.17	0.8965 $\pm$ 0.0031
No inhibition	Conv1	84.71 $\pm$ 1.04	0.1538 $\pm$ 0.0069
	Conv2	96.15 $\pm$ 0.17	0.1621 $\pm$ 0.0056
	FC	98.47 $\pm$ 0.07	0.1365 $\pm$ 0.0052

Table 5. TABLE V: Recognition rates on MNIST w.r.t. STDP rules ( t 𝗍𝖺𝗋𝗀𝖾𝗍 = 0.75 subscript 𝑡 𝗍𝖺𝗋𝗀𝖾𝗍 0.75 t_{\sf target}=0.75 ).

STDP rule	Recognition rate	Sparsity
Additive STDP	96.10 $\pm$ 0.33	0.8057 $\pm$ 0.0127
Multiplicative STDP ( $β = 2.0$ )	97.99 $\pm$ 0.10	0.6298 $\pm$ 0.0052
Multiplicative STDP ( $β = 3.0$ )	98.22 $\pm$ 0.06	0.3215 $\pm$ 0.0154
Multiplicative STDP ( $β = 4.0$ )	97.67 $\pm$ 0.11	0.1203 $\pm$ 0.0044
Biological STDP ( $τ = 0.05$ )	98.04 $\pm$ 0.14	0.0622 $\pm$ 0.0072
Biological STDP ( $τ = 0.1$ )	98.47 $\pm$ 0.07	0.1335 $\pm$ 0.0066
Biological STDP ( $τ = 0.5$ )	98.16 $\pm$ 0.13	0.2220 $\pm$ 0.0096

Table 6. TABLE VI: Recognition rates of multi- t 𝗍𝖺𝗋𝗀𝖾𝗍 subscript 𝑡 𝗍𝖺𝗋𝗀𝖾𝗍 t_{\sf target} SNNs on MNIST. Each configuration has a total of 4096 output neurons.

$N$	$t_{𝗍𝖺𝗋𝗀𝖾𝗍}$	Rec. rate
4096	0.750	98.47 $\pm$ 0.07
2048	0.300, 0.750	98.51 $\pm$ 0.06
2048	0.650, 0.750,	98.53 $\pm$ 0.06
1024	0.300, 0.500, 0.700, 0.800	98.59 $\pm$ 0.06
1024	0.650, 0.700, 0.750, 0.800	98.60 $\pm$ 0.08
512	0.200, 0.300, 0.400, 0.500, 0.600, 0.700, 0.800, 0.900	98.48 $\pm$ 0.05
512	0.675, 0.700, 0.725, 0.750, 0.775, 0.800, 0.825, 0.850	98.57 $\pm$ 0.08

Table 7. TABLE VII: Comparison with different spiking models with STDP from the literature (MNIST).

Model	Description	Recognition rate
Querlioz et al. 2011[26]	Single layer SNN	93.50
Dielh et al. 2015[11]	Single layer SNN	95.00
Tavanaei et al. 2016[30]	CSNN+SVM	98.36
Kheradpisheh et al. 2018[18]	CSNN+SVM	98.40
Dielh et al. 2015[12]	Converted CSNN	99.10
This work	CSNN+SVM	98.60

Table 8. TABLE VIII: Architecture used on Faces/Motorbikes.

Type	Filter size	Filter number	Stride	Padding
Convolution	$5 \times 5$	32	1	2
Pooling	$7 \times 7$	32	6	3
Convolution	$17 \times 17$	64	1	8
Pooling	$5 \times 5$	64	5	2
Convolution	$5 \times 5$	128	1	2

Equations31

DoG (x, y) = I (x, y) * (G_{DoG_{size}, DoG_{center}} - G_{DoG_{size}, DoG_{surround}})

DoG (x, y) = I (x, y) * (G_{DoG_{size}, DoG_{center}} - G_{DoG_{size}, DoG_{surround}})

G_{K, σ} (u, v) = \frac{g _{σ} ( u , v )}{i = - μ \sum μ j = - μ \sum μ g _{σ} ( i , j )}, u, v \in [- μ, μ], μ = \frac{K}{2},

G_{K, σ} (u, v) = \frac{g _{σ} ( u , v )}{i = - μ \sum μ j = - μ \sum μ g _{σ} ( i , j )}, u, v \in [- μ, μ], μ = \frac{K}{2},

x_{on}

x_{on}

x_{off}

f : [0, 1]

f : [0, 1]

x

t = T_{start} + (1 - x) * (T_{end} - T_{start})

t = T_{start} + (1 - x) * (T_{end} - T_{start})

\frac{\partial V}{\partial t} = i \in S \sum V_{i} δ (t - t_{i}), V \leftarrow 0 when V \geq V_{th}

\frac{\partial V}{\partial t} = i \in S \sum V_{i} δ (t - t_{i}), V \leftarrow 0 when V \geq V_{th}

\Delta_{W}=\left\{\begin{array}[]{r l}\eta_{W}&\text{if }t_{\sf pre}\leq t_{\sf post}\\ -\eta_{W}&\text{o.w.}\end{array}\right.

\Delta_{W}=\left\{\begin{array}[]{r l}\eta_{W}&\text{if }t_{\sf pre}\leq t_{\sf post}\\ -\eta_{W}&\text{o.w.}\end{array}\right.

\Delta_{W}=\left\{\begin{array}[]{r l}\eta_{W}e^{-\beta\frac{W-W_{\sf min}}{W_{\sf max}-W_{\sf min}}}&\text{if }t_{\sf pre}\leq t_{\sf post}\\ -\eta_{W}e^{-\beta\frac{W_{\sf max}-W}{W_{\sf max}-W_{\sf min}}}&\text{o.w.}\end{array}\right.

\Delta_{W}=\left\{\begin{array}[]{r l}\eta_{W}e^{-\beta\frac{W-W_{\sf min}}{W_{\sf max}-W_{\sf min}}}&\text{if }t_{\sf pre}\leq t_{\sf post}\\ -\eta_{W}e^{-\beta\frac{W_{\sf max}-W}{W_{\sf max}-W_{\sf min}}}&\text{o.w.}\end{array}\right.

\Delta_{W}=\left\{\begin{array}[]{r l}\eta_{W}e^{-\frac{t_{\sf pre}-t_{\sf post}}{\tau}}&\text{if }t_{\sf pre}\leq t_{\sf post}\\ -\eta_{W}e^{-\frac{t_{\sf post}-t_{\sf pre}}{\tau}}&\text{o.w.}\end{array}\right.

\Delta_{W}=\left\{\begin{array}[]{r l}\eta_{W}e^{-\frac{t_{\sf pre}-t_{\sf post}}{\tau}}&\text{if }t_{\sf pre}\leq t_{\sf post}\\ -\eta_{W}e^{-\frac{t_{\sf post}-t_{\sf pre}}{\tau}}&\text{o.w.}\end{array}\right.

V_{th} = max (Th_{min}, V_{th} - η_{th} (t - t_{target}))

V_{th} = max (Th_{min}, V_{th} - η_{th} (t - t_{target}))

Δ V_{th_{i}}

Δ V_{th_{i}}

V_{th_{i}}

x = min (1, max (0, 1 - \frac{t - t _{target}}{T _{end} - t _{target}}))

x = min (1, max (0, 1 - \frac{t - t _{target}}{T _{end} - t _{target}}))

y_{i} = x = 0 \sum w y = 0 \sum h v_{x y i}

y_{i} = x = 0 \sum w y = 0 \sum h v_{x y i}

sp (y) = \frac{n _{y} - \frac{\sum _{i}^{n_{y}} ∣ y _{i} ∣}{\sum _{i}^{n_{y}} y _{i}^{2}}}{n _{y} - 1}

sp (y) = \frac{n _{y} - \frac{\sum _{i}^{n_{y}} ∣ y _{i} ∣}{\sum _{i}^{n_{y}} y _{i}^{2}}}{n _{y} - 1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Neural dynamics and brain function

MethodsSupport Vector Machine

Full text

Multi-layered Spiking Neural Network with Target Timestamp Threshold Adaptation and STDP

††thanks: This work has been partly funded by IRCICA (Univ. Lille, CNRS, USR 3380 – IRCICA, F-59000 Lille, France) as the Bioinspired Project.

Pierre Falez1, Pierre Tirilly1 2, Ioan Marius Bilasco1, Philippe Devienne1, and Pierre Boulet1

1*Univ. Lille, CNRS, Centrale Lille,**UMR 9189 – CRIStAL – Centre de Recherche en Informatique, Signal et Automatique de Lille

*F-59000, Lille, France

2IMT Lille Douai, F-59000, Lille, France

Email: [email protected]

Abstract

Spiking neural networks (SNNs) are good candidates to produce ultra-energy-efficient hardware. However, the performance of these models is currently behind traditional methods. Introducing multi-layered SNNs is a promising way to reduce this gap. We propose in this paper a new threshold adaptation system which uses a timestamp objective at which neurons should fire. We show that our method leads to state-of-the-art classification rates on the MNIST dataset (98.60%) and the Faces/Motorbikes dataset (99.46%) with an unsupervised SNN followed by a linear SVM. We also investigate the sparsity level of the network by testing different inhibition policies and STDP rules.

Index Terms:

Convolutional neural networks, Neural network hardware, Pattern recognition, Unsupervised learning

I Introduction

Computer vision has rapidly evolved in recent years, in particular thanks to deep learning methods [19]. They show that using deep hierarchical representations improves the expressiveness of models [22], and yields state-of-the-art performance on many tasks [15] [27]. However, the question of the energy consumption of such models remains less frequently addressed, even though it has been raised by some authors [1] [7] [17].

Although efforts are being made to produce more energy-efficient architectures for traditional methods [29], producing ultra-low-power architectures seems to require using different classes of models. Spiking neural networks (SNNs) are good candidates to create energy-efficient hardware [21] [28]. To achieve this goal, SNNs use mechanisms closer to biology, notably the fact that computations and memory are exclusively local [25]. Instead of using numerical representations like traditional methods, SNNs use spikes to transmit information, which radically changes their learning mechanisms. However, their classification performances currently remain behind traditional deep learning methods [14]. This gap is in part due to the constraint of local computation in SNNs, which prevents the use of traditional training methods like back-propagation. New learning mechanisms are necessary to bypass this limit and allow them to compete with state-of-the-art methods.

The performance of SNN is highly sensitive to network meta-parameters. As a consequence, an exhaustive search for all parameters, which are numerous, is generally required when using SNNs. This makes SNNs difficult to use. Reducing the impact of parameter values, by using auto-adaptive parameters, or at least, reducing the number of parameters, seems to be a key point in order to be able to make SNNs viable.

More specifically, neuron thresholds are one of the key parameters in a SNN. They determine the amount of spikes required by the neurons to trigger a spike; it directly impacts the patterns that neurons can recognize, and so, the performance of the network. The optimal threshold value can vary widely over the different neurons in a network because the inputs or internal patterns are made up of different numbers of spikes.

In this paper, we propose a new threshold learning rule, which is based on a target timestamp $t_{\sf target}$ at which a neuron must fire. This target timestamp directly controls the patterns that neurons can learn. By providing an adaptive threshold, this mechanism reduces the impact of its initial value. Moreover, thanks to the usage of a unique parameter, the search space to optimize is relatively small. Additionally, we provide a protocol to train multi-layered networks. We evaluate this mechanism with multi-layered SNNs on the Faces/Motorbikes [18] and MNIST [20] datasets. We study the impact of our threshold adaptation system, but also of the inhibition policy and of the STDP rule. Finally, we show that we can combine multiple networks trained with different $t_{\sf target}$ to improve the classification rate thanks to the different patterns learned by the network. This method reaches state-of-the-art results on both the Faces/Motorbikes and MNIST datasets.

II Related work

Early work in image recognition with fully unsupervised SNNs used single-layered networks [26] [11]. However, such methods yield low classification rates on the MNIST dataset (93.5% [26], 95% [11]) compared to traditional methods [20]. One of the first multi-layered STDP networks is [30]. In this work, a dedicated network, SAILNet, learns convolution filters from patches extracted from input samples. A pooling layer and a fully connected layer using probabilistic LIF neurons are stacked. A support vector machine (SVM) classifies the output of the last layer. This model reaches 98.36% on MNIST with 32 convolution filters and 128 output neurons. However, the usage of an external network to train convolutions remains an issue. Moreover, probabilistic LIF neurons are used in the feature discovery layer, which requires some global computation (softmax) to operate.

In [18], two convolution layers trained by STDP are used. The network reaches 98.4% on the MNIST dataset with 30 filters in the first convolution layer and 100 in the second one. However, this model uses some global computations: the potentials of neurons are compared to each other to designate the winner at every step and the filters are learned across the convolution columns. This model requires to tune its parameters carefully, especially the neuron thresholds. Moreover, the values of neuron thresholds must be manually changed between the training and testing stages. Finally, the output neurons use infinite thresholds, which would not be realistic on hardware.

Other authors focus on converting traditional deep neural networks, trained with back-propagation, into multi-layered SNNs [12] [8]. However, this method limits the interest of SNNs, since only the inference stage can be energy-efficient. Moreover, such networks are not able to adapt themselves continuously, since their parameters are fixed after the conversion. Other work adapts the back-propagation method to SNNs [5] [24] [2]. However, these models cannot be as energy-efficient since they need global computations to perform back-propagation.

III Background

In contrast to traditional artificial neural networks (ANNs), which use numerical values to represent information, SNNs use electrical impulses, called spikes. In this paper, for simplicity reasons, spikes are represented by a Dirac impulsion, defined by a timestamp $t$ and a voltage $V$ (Figure 1).

III-A Pre-processing

Before input samples are converted into spikes to be fed to the network, some pre-processing steps are applied. We use a difference-of-Gaussians (DoG) filter to simulate on-off cells [10]. Without this pre-processing, SNNs fail to learn useful patterns, leading to low classification performances [14].

DoG filters are applied with the same process as the one described in [18]:

[TABLE]

where $I$ is the input image, $\ast$ is the convolution operator and $G_{K,\sigma}$ is a normalized Gaussian kernel of size $K$ and scale $\sigma$ defined as:

[TABLE]

with $g_{\sigma}$ the centered 2D Gaussian function of variance $\sigma$ . The parameters of the filter are its size $\mathrm{DoG}_{\sf size}$ and the variances of the Gaussian kernels $\mathrm{DoG}_{\sf center}$ and $\mathrm{DoG}_{\sf surround}$ .

After applying DoG filtering over an input image, the resulting values are separated into two channels:

[TABLE]

III-B Neural Coding

Since SNNs use spikes to transfer information within the network, it is necessary to define a function to encode the numerical values of input samples into spikes trains and a function to decode spike trains at the output of the network. The encoding function is referred to as the neural coding. Mathematically, a neural coding can be described as follows:

[TABLE]

with $x$ the input pixel value and $\left(t_{0},t_{1},\cdots,t_{N_{x}}\right)$ the timestamps of the generated spikes.

Neural coding is subject to debate in the SNN community [6]. Two main coding techniques exist: frequency coding and temporal coding. While frequency coding uses spike frequencies to encode values, temporal coding uses the timestamps of spikes. One of the most used methods is latency coding [31], in which early spikes encode the largest values, while late spikes encode the lowest values:

[TABLE]

with $[T_{\sf start},T_{\sf end}]$ the time range of the sample, $x\in[0,1]$ the input value, and $t$ the timestamp of the generated spike.

This paper uses latency coding as neural coding as it has the main advantage of using few spikes (at most one spike per connection) to represent values, which makes the model easier to control. However, in latency coding, the timestamps at which neurons discharge are critical since they have a direct impact on the represented values.

III-C Neuron Model

In this paper, we use integrate-and-fire (IF) neurons, which are one of the simplest spiking neuron models. This model integrates input spikes to its membrane potential $V$ . If $V$ exceeds a defined threshold $V_{\sf th}$ , then an output spike is triggered and $V$ is reset to 0. The model is defined by the following formula:

[TABLE]

with $S$ the set of incoming spikes, $V_{i}$ the voltage of the $i$ th spike, $t_{i}$ the timestamp of the $i$ th spike and $\delta$ the Dirac function. In addition, all potentials are reset to zero between each sample. If a neuron fires a spike during the presentation of a sample, it enters its refractory mode until the end of the sample. This constraint forces neurons to fire at most once per sample, in order to comply with latency coding.

III-D Synapse Model

Synapses modulate the spike voltage $V$ that passes through connections according to their synaptic weights $W$ : $V_{O}=WV_{I}$ , with $V_{I}$ the voltage of the spike at the input of the synapse and $V_{O}$ its voltage at the output. This weight can be constant or can be trained following a learning rule. In our synapse model, $W$ is clipped in the range [ $W_{\sf min}$ , $W_{\sf max}$ ]. One of the most used learning rules is spike-timing-dependent plasticity (STDP) [3], which updates the weights according to the difference between the firing timestamps of the input neuron and the output neuron. One of the simplest forms of the STDP rule is additive STDP [4]. Its principle is to increase connection weights where input neurons fire spikes before output neurons (long-term potentiation) and to decrease the others (long-term depression). Mathematically, additive STDP can be written as:

[TABLE]

with $\Delta_{W}$ the weight variation, $\eta_{W}$ the learning rate, $t_{\sf pre}$ the firing timestamp of the input neuron ( $+\infty$ if no spike occurs) and $t_{\sf post}$ the firing timestamp of the output neuron.

Other forms of STDP exist in the literature. Multiplicative STDP [26] allows reducing the effect of weight saturation by using updates that depend on the current value of $W$ . This STDP rule is defined by the following formula:

[TABLE]

with $\beta$ the parameter which controls the saturation effect (increasing $\beta$ reduces the saturation).

Finally, biological STDP [3] adds non-linearity by including a leak according to the delay between $t_{\sf pre}$ and $t_{post}$ :

[TABLE]

with $\tau$ the time constant that controls the leak.

III-E Network Architecture

The network is composed of stacked feed-forward layers. For a layer $L(n)$ , there are $L_{d}(n)$ feature maps, each of them containing $L_{w}(n)\times L_{h}(n)$ neurons. Three types of layers are used in this paper: convolution, pooling, and fully-connected layers. The shape of a layer depends on its filter size $F_{w}(n)\times F_{h}(n)$ , its padding $P(n)$ , and its stride $S(n)$ . Each neuron of layer $L(n)$ is connected to $F_{w}(n-1)\times F_{h}(n-1)\times L_{d}(n-1)$ neurons of the previous layer, which form the receptive field of the neuron. In the pooling layers, all the parameters are constant: neuron thresholds and synaptic weights are fixed to 1. When a spike is triggered in its receptive field, a pooling neuron directly fires a spike. This mimics a max-pooling operation. A column $C_{x,y}(n)$ designates the $L_{d}(n)$ neurons present at position $(x,y)$ in the $L_{d}(n)$ features maps of $L(n)$ .

IV Contribution

The introduction of new mechanisms to help neurons fire spikes at optimal timestamps is of paramount importance. We introduce in this paper a novel threshold adaption mechanism that trains neurons to discharge at a defined timestamp.

IV-A Time target threshold adaptation

Thresholds have a major role in the behavior of spiking neurons [9]. First, threshold values directly impact the patterns recognized by neurons. Large threshold values will allow recognizing patterns composed of large numbers of spikes (Figure 2). Neurons with smaller threshold values will use only the first input spikes in the pattern recognition process. Since latency coding is used, the first spikes encode the largest values of the input sample, which means that the neuron focuses on the most salient parts of the input, learning very local patterns, like edges. On the contrary, large threshold values allow the neuron to integrate more spikes before firing, including late spikes, which encode smaller values. Neurons with large thresholds can recognize larger patterns, like surfaces. However, the optimal threshold value is unknown and is highly dependent on the data. Finally, threshold adaptation allows maintaining the homeostasis of the system: it ensures that no neuron takes advantage over the others. A common method to adapt thresholds in SNNs is to use leaky adaptive thresholds [11]: when a neuron fires a spike, its threshold is increased to prevent it from firing too often. An exponential leak is applied to help neurons with weak activities. However, this mechanism uses two parameters, which makes the search of suited values difficult [13]. Moreover, those parameters do not enhance the convergence towards the different types of patterns shown in Figure 2. This paper introduces a new method to adjust neuron thresholds. The idea is to define an objective timestamp $t_{\sf target}$ , and to train neurons to fire at this timestamp. To do so, we define a threshold adaptation rule, as follows:

[TABLE]

with $V_{\sf th}$ the neuron threshold, $t$ the timestamp at which the neuron fires, $\eta_{th}$ the threshold learning rate and $\mathrm{Th}_{\sf min}$ the minimal threshold allowed. This rule corrects the timing error between the actual firing timestamp $t$ and the objective timestamp $t_{\sf target}$ at each neuron discharge. The optimal value for $t_{\sf target}$ depends on the dataset; it requires an exhaustive search in the range $[T_{\sf start},T_{\sf end}]$ .

This rule assumes that the input spikes that trigger an output spike are not simultaneous, which is the case in practice with image data. With data that does not verify this assumption, synaptic delays would have to be adapted.

IV-B Competition System

Using local and unsupervised learning requires competition mechanisms in order to ensure that neurons learn distinct patterns [26]. Winner-take-all (WTA) inhibition is a straightforward method to do so: only the winning neuron (i.e. the first neuron to spike, since latency coding is used) will apply the learning rule during a pattern and, so, will be able to recognize it. However, the risk of the WTA strategy is that one neuron can take the advantage over the others, and win on every sample. To guarantee the homeostasis of the system, a second update is applied to $V_{\sf th}$ : the threshold of the winning neuron is increased, while the thresholds of inhibited neurons are decreased, following the formula:

[TABLE]

with $N$ the number of neurons in competition, and $t_{i}$ the firing timestamp of neuron $i$ .

WTA inhibition is used during training: only one neuron is allowed to fire among the $N$ neurons on each sample. This mechanism is required to guarantee that neurons will learn different patterns, since only one neuron will apply STDP per sample. However, WTA inhibition drastically reduces the spiking activity, which can lead to poor classification performance [14]. For this reason, we remove the inhibition mechanism during the inference stage. An intermediate inhibition policy, named soft inhibition, is also investigated in this paper. This policy uses inhibition spikes, which reduce the membrane voltage $V$ of the other neurons by a $V_{\mathrm{inh}}$ constant, but does not prevent them from firing spikes.

IV-C Network Output

The last step is to interpret the output of the network. Since latency coding is used, the earliest output spikes will encode the highest values. Output values are computed according to the expected $t_{\sf target}$ set in the output layer, as follows:

[TABLE]

with $t$ the spike timestamp (set to $+\infty$ if no spike occurs).

IV-D Training

Traditionally, convolution layers require to perform non-local operations and to use non-local memory since they use shared weights: columns need to communicate with each other to share the same filters. We use a specific training protocol in order to reduce the cost of the global communication needed by the convolutions. One layer is trained at a time, from the layer closest to the input to the one at the output of the network. During the training of a convolution layer, only one column is activated to avoid the usage of inter-column communications. Once the layer is trained, its parameters (weights and thresholds) are fixed and are copied onto the other columns of the layer. This operation is necessary since pooling layers require the same filters in adjacent columns. In order to keep the position invariance brought by shared weights, random patches of size $F_{w}(n)\times F_{h}(n)$ are extracted from inputs of the layer. Unlike in [18], neurons do not react only to the most salient part of each image.

V Results

V-A Experimental Protocol

For each trained layer, the training set is processed $N_{\sf epoch}$ times. A simulated annealing procedure is applied after every epoch: the learning rates ( $\eta_{W}$ and $\eta_{Th}$ ) are decreased by a factor $\lambda$ . It helps converge to a stable state during training. Once the training is finished, the training set and the test set are processed by the network, which converts all the samples into their output representation. If the output layer has multiple columns (i.e. $L_{w}(n)>1$ or $L_{h}(n)>1$ ), sum pooling is applied over the positions of the feature maps to produce a feature vector $y=(y_{1},...,y_{x})$ :

[TABLE]

with $v_{xyi}$ the value of output of the network at position $(x,y)$ in feature map $i$ .

If the output layer has only one column, it directly outputs vector $y$ . An SVM with a linear kernel is trained over the output training set. SVM parameters are not optimized (we set $C=1$ ). Figure 3 shows the complete network topology. Besides classification rates, we investigate the sparsity of the network. The sparsity is computed over the output vectors $y$ of the test set with the following formula, used in [16]:

[TABLE]

with $n_{y}$ the vector dimension. This measure produces values in [0, 1]. Values close to 1 mean that the vector is sparse (i.e. most of the features are close to 0).

All the results reported in this paper are averaged over 10 runs. The default parameters are reported in Table I.

V-B MNIST

MNIST is a handwritten digit dataset [20]. The training set contains 60,000 samples and the test set contains 10,000 samples. The network architecture is detailed in Table II.

V-B1 Threshold Target Time

First, we study the impact of the parameter $t_{\sf target}$ . It directly impacts both the learned filters (Figure 4) and the classification performance (Figure 5). While low values of $t_{\sf target}$ lead to very local patterns (Figure 4a), larger values lead to more global patterns (Figure 4c). Using late $t_{\sf target}$ , and, so, training neurons to integrate a large number of spikes, helps to improve the classification rate. However, the performance decreases with very late $t_{\sf target}$ : the latest spikes, which encode the lowest input values, are not useful for pattern classification. Networks with $t_{\sf target}=0.75$ yield state-of-the-art results for SNNs trained with STDP on the MNIST dataset: 98.47% (see Table VII for competing approaches). The two update mechanisms described in Equation 8 and Equation 9 are necessary to reach good classification rates. When Equation 9 is disabled in the threshold update, the homeostasis of the system is not maintained, which leads to a classification rate of $94.54\pm 1.16$ % when $t_{\sf target}=0.75$ . When Equation 8 is disabled, controlling the type of pattern to be learned becomes difficult and highly dependent on the initial values of the thresholds $V_{\sf Th}(0)$ . Using different $t_{\sf target}$ values across the layers decreases the performance (Table III). Let $\Delta_{t}$ be the difference between the $t_{\sf target}$ parameters of two consecutive layers. Since neurons of the previous layer are trained to fire at specific timestamps, setting an earlier $t_{\sf target}$ ( $\Delta_{t}<0$ ) on the current layer results in missing spikes from the previous neurons. Setting a later $t_{\sf target}$ ( $\Delta_{t}>0$ ) results in taking into account spikes that come too late after the $t_{\sf target}$ of the previous layer. A spike which arises too late compared to $t_{\sf target}$ means that the current pattern is not similar to those usually recognized by the input neuron. With small values of $|\Delta_{t}|$ , the performance of the network remains stable, which shows that the threshold adaptation mechanism is noise-resistant to some extent. However, large values of $|\Delta_{t}|$ have a negative impact on the classification rate, especially when $\Delta_{t}<0$ . $\Delta_{t}$ is inversely proportional to the sparsity: positive values of $\Delta_{t}$ tend to let neurons integrate more spikes and, so, allow more neurons to fire, which decreases sparsity. For $\Delta_{t}=-0.20$ , the classification rate and sparsity are very low because the network cannot generate any spike: the $t_{\sf target}$ of the second layer is defined to be a timestamp at which no spikes have been generated yet by the first layer. $\Delta_{t}=0.01$ yields the best result: 99.53%. This small offset seems to reinforce the resistance to noise, without integrating spikes generated by unrelated patterns. These results show that finding a single value for $t_{\sf target}$ is sufficient in the exhaustive search, and the other $t_{\sf target}$ can be defined by using a very small or null $\Delta_{t}$ . This makes it easy to set the threshold adaptation of a multilayer SNN.

V-B2 Inhibition

We run experiments to show the impact of the inhibition strategy on recognition rates. We compare the three inhibition policies detailed in Section IV-B. Table IV shows that increasing the hardness of inhibition during inference tends to decrease the recognition rate. This can be related to the sparsity level. The effect of inhibition, which is minimal in the first layer, is accentuated after each layer. This effect strongly impacts both the sparsity and the recognition rate in the fully connected layer. This effect is visible with soft inhibition, but is maximal with the WTA policy: the sparsity of the fully-connected layer is 1, while the recognition rate is only 63.43%. Maintaining higher levels of activity helps to learn better representations.

V-B3 STDP Rule

We study the effects of the STDP rules on the network classification rates and sparsity. We test the three STDP rules described in Section III-D: additive STDP, multiplicative STDP, and biological STDP (Table V). Additive STDP yields a baseline performance of 96.10% and a relatively high level of sparsity (0.8057). Figure 6a shows that this rule leads to binary weights (0 or 1) due to a saturation effect. Multiplicative STDP reduces this effect using the $\beta$ parameter: large values of $\beta$ reduce drastically the number of weights close to 0 or 1 (Figure 6b). Table V shows that increasing $\beta$ decreases the sparsity. $\beta=3.0$ provides a classification rate of 98.22% and a sparsity of 0.3215. Finally, the best performance (98.47%) is reached with biological STDP with $\tau=0.1$ . Decreasing this parameter also reduces the sparsity. Figure 6c shows that filters learned by biological STDP look different from the ones learned by other STDP rules. Indeed, additive and multiplicative STDP rules never learn patterns in which the on and off channels overlap (i.e. red and green pixels are always separated in the filters), because our input coding does not allow generating a spike from both channels at the same position. In contrast, biological STDP leads to filters with reinforced connections on the two channels (yellow pixels), which means that biological STDP is able to combine multiple patterns. Whatever the STDP rule, multiplicative or biological STDP, networks with the lowest levels of sparsity never yield the best classification performances.

The shapes of the filters in the fully-connected layer also differ from one STDP rule to another. While additive and multiplicative STDPs lead to easily identifiable digits (Figure 7a), biological STDP provides less obvious filters (Figure 7b). The non-linearity brought by biological STDP seems to allow learning more complex features, improving performances.

V-B4 Multiple Target Timestamp Networks

Finally, we use several groups of neurons with different $t_{\sf target}$ . Representations learned with different target timestamps can contain more diverse patterns. We train independent networks where all layers are set with a given $t_{\sf target}$ value. Then, we merge the features at the output of each network by concatenating them and feed the resulting feature vector to the classifier. To make a fair comparison, we compare configurations that result in feature vectors of the same dimension (4096).

Table VI shows that using multiple targets improves the classification performance. The network reaches a recognition rate of 98.60%, which is better than existing comparable methods (Table VII). One explanation can be that the combination of different $t_{\sf target}$ allows detecting more varied patterns. Only methods that convert ANNs to SNNs after training [12] outperform our model.

V-C Faces/Motorbikes

Finally, we test our model on the Faces/Motorbikes dataset used in [18] in order to ensure that the model performs well with more realistic images. The dataset contains two classes extracted from the Caltech-101 dataset: faces and motorbikes. Similarly to [18], images are resized to $250\times 160$ pixels, then converted into the grayscale format. The training set has 474 samples and the test set has 759 samples. Since our training protocol differs from [18] (Section IV-D), it is necessary to increase the number of filters in the convolution layer and to use larger values for $\mathrm{Th}_{\sf min}$ (in the following experiment, we choose 8) to focus on patterns resulting from enough spikes. We use additive STDP in all the convolution layers. The detailed architecture is provided in Table VIII.

Our model gives results similar to those reported in [18] (Figure 8), where the best reported result is 99.1%. When using $t_{\sf target}=0.8$ , our model performs better with an average of 99.46%. The learned filters are similar to [18] (Figure 9).

VI Discussion

Our model is almost fully local and is unsupervised from the input to the classifier. However, convolution layers remain an issue for hardware implementations. We succeeded in learning one convolution column independently from the others, but the weight and threshold values still have to be copied onto the other columns after training. This is needed to reconstruct the geometry of the feature maps, to apply pooling for instance. Moreover, we used a linear SVM for classification. However, to have a fully hardware-implementable SNN, using bio-inspired classifiers is required. Recent work succeeds in using supervised STDP as a classifier in multi-layered SNN [23]. We aim at investigating the performance of our model with such learning rules, while respecting the constraint of local computations. Finally, results showed that $t_{\sf target}$ has a strong impact on the classification performance of the network. This parameter could be made auto-adaptable, so that neurons can find by themselves the best timing for firing. Such mechanisms would have the advantage of setting an optimal $t_{\sf target}$ value for each feature independently.

VII Conclusion

Previous multi-layered SNN models require a particular attention in setting neuron thresholds. An exhaustive search is needed to optimize them. Moreover, the optimal values vary from one layer to another [18]. We introduced a threshold adaptation mechanism, which relies on a single parameter for all the layers and learns more varied patterns. Experiments showed that our model leads to state-of-the-art results with unsupervised SNNs on MNIST (98.60%) and on Faces/Motorbikes (99.46%). We also showed that removing the inhibition during the inference step helps to reduce the sparsity of the model, which leads to an improvement of the performance. We also investigated the STDP rules and showed that biological STDP helps to improve the network performance by introducing non-linearity.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Ron Bekkerman, Mikhail Bilenko, and John Langford. Scaling up machine learning: Parallel and distributed approaches . Cambridge University Press, 2011.
2[2] Yoshua Bengio, Dong-Hyun Lee, Jorg Bornschein, Thomas Mesnard, and Zhouhan Lin. Towards biologically plausible deep learning. ar Xiv preprint:1502.04156 , 2015.
3[3] Guo-Giang Bi and Mu-Ming Poo. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of neuroscience , 18(24):10464–10472, 1998.
4[4] Olivier Bichler, Damien Querlioz, Simon J. Thorpe, Jean-Philippe Bourgoin, and Christian Gamrat. Unsupervised features extraction from asynchronous silicon retina through spike-timing-dependent plasticity. In IJCNN , pages 859–866, 2011.
5[5] Sander M. Bohte, Joost N. Kok, and Han La Poutre. Error-backpropagation in temporally encoded networks of spiking neurons. Neurocomputing , 48(1):17–37, 2002.
6[6] Romain Brette. Philosophy of the spike: rate-based vs. spike-based theories of the brain. Frontiers in systems neuroscience , 9, 2015.
7[7] Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision , 113(1):54–66, 2015.
8[8] Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision , 113(1):54–66, 2015.