Understanding the computational difficulty of a binary-weight perceptron   and the advantage of input sparseness

Zedong Bi; Changsong Zhou

arXiv:1901.10856·cond-mat.dis-nn·July 13, 2020

Understanding the computational difficulty of a binary-weight perceptron and the advantage of input sparseness

Zedong Bi, Changsong Zhou

PDF

TL;DR

This study investigates how low-precision binary weights and input sparseness affect the computational difficulty of training perceptrons, revealing that input sparseness simplifies the solution-finding process by reducing weight correlations.

Contribution

It demonstrates that input sparseness alleviates computational difficulty in binary perceptrons by decreasing cross-correlations among weights during decimation, providing insights into neural and artificial learning processes.

Findings

01

Input sparseness reduces the number of steps needed to fix late-decimation weights.

02

Decimation approaches the dense solution region in weight configuration space.

03

Computational difficulty stems from the subspace of late-decimation-fixed variables.

Abstract

Limited precision of synaptic weights is a key aspect of both biological and hardware implementation of neural networks. To assign low-precise weights during learning is a non-trivial task, but may benefit from representing to-be-learned items using sparse code. However, the computational difficulty resulting from low weight precision and the advantage of sparse coding remain not fully understood. Here, we study a perceptron model, which associates binary (0 or 1) input patterns with outputs using binary (0 or 1) weights, modeling a single neuron receiving excitatory inputs. We considered a decimation process, where every time step, marginal probabilities of unfixed weights were evaluated, then the most polarized weight was fixed at its preferred value. We showed that decimation is a process approaching the dense solution region in weight configuration space. In two efficient algorithms…

Equations72

h_{i \to μ}^{t + 1} = f [p (t) h_{i}^{t} + ϕ_{i \to μ} (ν \in \partial i \ μ ⋃ \hat{h}_{ν \to i}^{t})] + (1 - f) h_{i \to μ}^{t},

h_{i \to μ}^{t + 1} = f [p (t) h_{i}^{t} + ϕ_{i \to μ} (ν \in \partial i \ μ ⋃ \hat{h}_{ν \to i}^{t})] + (1 - f) h_{i \to μ}^{t},

\hat{h}_{μ \to i}^{t + 1} = \hat{ϕ}_{μ \to i} (j \in \partial μ \ i ⋃ h_{j \to μ}^{t}),

\hat{h}_{μ \to i}^{t + 1} = \hat{ϕ}_{μ \to i} (j \in \partial μ \ i ⋃ h_{j \to μ}^{t}),

h_{i}^{t + 1} = p (t) h_{i}^{t} + ϕ_{i} (ν \in \partial i ⋃ \hat{h}_{ν \to i}^{t}) .

h_{i}^{t + 1} = p (t) h_{i}^{t} + ϕ_{i} (ν \in \partial i ⋃ \hat{h}_{ν \to i}^{t}) .

Δ_{i, BPD} = [⟨(w_{i}^{x} - w_{i}^{BPD})^{2} ⟩_{x}],

Δ_{i, BPD} = [⟨(w_{i}^{x} - w_{i}^{BPD})^{2} ⟩_{x}],

F_{l oc a l} (d, N_{fix}) = \frac{1}{N} [⟨ Θ (N (w, d)) lo g (N (w, d)) ⟩_{w \in S (N_{fix})}],

F_{l oc a l} (d, N_{fix}) = \frac{1}{N} [⟨ Θ (N (w, d)) lo g (N (w, d)) ⟩_{w \in S (N_{fix})}],

c_{ij} = ⟨ w_{i} w_{j} ⟩ - ⟨ w_{i} ⟩ ⟨ w_{j} ⟩,

c_{ij} = ⟨ w_{i} w_{j} ⟩ - ⟨ w_{i} ⟩ ⟨ w_{j} ⟩,

\overline{XCorr^{2}} = \frac{1}{N _{unfix} ( N _{unfix} - 1 )} [i \neq = j \sum c_{ij}^{2}]

\overline{XCorr^{2}} = \frac{1}{N _{unfix} ( N _{unfix} - 1 )} [i \neq = j \sum c_{ij}^{2}]

\overline{XCorr} = \frac{1}{N _{unfix} ( N _{unfix} - 1 )} [i \neq = j \sum c_{ij}]

\overline{XCorr} = \frac{1}{N _{unfix} ( N _{unfix} - 1 )} [i \neq = j \sum c_{ij}]

\overline{XCorr^{2}} = \frac{1}{N _{unfix}^{2}} [i, j \sum c_{ij}^{2}] = \frac{1}{3} x_{*} (1 - x_{*}) (q_{same} - q_{diff})^{2},

\overline{XCorr^{2}} = \frac{1}{N _{unfix}^{2}} [i, j \sum c_{ij}^{2}] = \frac{1}{3} x_{*} (1 - x_{*}) (q_{same} - q_{diff})^{2},

x_{*} = 1 - [γ \sum (\frac{N _{γ}}{N _{so l u t i o n}})^{2}],

x_{*} = 1 - [γ \sum (\frac{N _{γ}}{N _{so l u t i o n}})^{2}],

k (N_{fix}) = \frac{1}{N} [\frac{\sum _{w \in S (N_{fix})} lo g ( N _{γ \subset S (N_{fix} = 0)} ( w ))}{\sum _{w \in S (N_{fix})} 1}],

k (N_{fix}) = \frac{1}{N} [\frac{\sum _{w \in S (N_{fix})} lo g ( N _{γ \subset S (N_{fix} = 0)} ( w ))}{\sum _{w \in S (N_{fix})} 1}],

h_{i \to μ}^{t + 1} = f [p (t) h_{i}^{t} + ν \in \partial i \ μ \sum \hat{h}_{ν \to i}^{t}] + (1 - f) h_{i \to μ}^{t}, \hat{h}_{μ \to i} = lo g \frac{H ( s ^{μ} \frac{θ - 1 - a _{μ \to i}}{σ _{μ \to i}} )}{H ( s ^{μ} \frac{θ - a _{μ \to i}}{σ _{μ \to i}} )},

h_{i \to μ}^{t + 1} = f [p (t) h_{i}^{t} + ν \in \partial i \ μ \sum \hat{h}_{ν \to i}^{t}] + (1 - f) h_{i \to μ}^{t}, \hat{h}_{μ \to i} = lo g \frac{H ( s ^{μ} \frac{θ - 1 - a _{μ \to i}}{σ _{μ \to i}} )}{H ( s ^{μ} \frac{θ - a _{μ \to i}}{σ _{μ \to i}} )},

s^{μ} = 2 σ^{μ} - 1, θ = N / A

s^{μ} = 2 σ^{μ} - 1, θ = N / A

a_{μ \to i} = j \neq = i \sum \frac{1}{1 + e ^{- h_{j \to μ}}}, σ_{μ \to i}^{2} = j \neq = i \sum \frac{e ^{- h_{j \to μ}}}{( 1 + e ^{- h_{j \to μ}} ) ^{2}} .

a_{μ \to i} = j \neq = i \sum \frac{1}{1 + e ^{- h_{j \to μ}}}, σ_{μ \to i}^{2} = j \neq = i \sum \frac{e ^{- h_{j \to μ}}}{( 1 + e ^{- h_{j \to μ}} ) ^{2}} .

h_{i}^{t + 1} = p (t) h_{i}^{t} + ν \in \partial i \sum \hat{h}_{ν \to i}^{t},

h_{i}^{t + 1} = p (t) h_{i}^{t} + ν \in \partial i \sum \hat{h}_{ν \to i}^{t},

p_{i \to μ} (w_{i}) \propto ν \in \partial i \ μ \prod \overset{p}{^}_{ν \to i} (w_{i})

p_{i \to μ} (w_{i}) \propto ν \in \partial i \ μ \prod \overset{p}{^}_{ν \to i} (w_{i})

\overset{p}{^}_{μ \to i} (w_{i}) \propto {w_{j}}_{j \in \partial μ \ i} \sum Θ (s^{μ} (w_{i} + j \in \partial μ \ i \sum w_{j} + N_{\partial μ}^{fix} - θ)) j \in \partial μ \ i \prod p_{j \to μ} (w_{j}),

\overset{p}{^}_{μ \to i} (w_{i}) \propto {w_{j}}_{j \in \partial μ \ i} \sum Θ (s^{μ} (w_{i} + j \in \partial μ \ i \sum w_{j} + N_{\partial μ}^{fix} - θ)) j \in \partial μ \ i \prod p_{j \to μ} (w_{j}),

p_{i} (w_{i}) \propto μ \in \partial i \prod \overset{p}{^}_{μ \to i} (w_{i}) .

p_{i} (w_{i}) \propto μ \in \partial i \prod \overset{p}{^}_{μ \to i} (w_{i}) .

h_{i \to μ} = ν \in \partial i \ μ \sum \hat{h}_{ν \to i}

h_{i \to μ} = ν \in \partial i \ μ \sum \hat{h}_{ν \to i}

\hat{h}_{μ \to i} = lo g \frac{\sum _{{w_{j}}_{j \in \partial μ \ i}} Θ ( s ^{μ} ( 1 + \sum _{j \in \partial μ \ i} w _{j} + N _{\partial μ}^{fix} - θ )) \prod _{j \in \partial μ \ i} p _{j \to μ} ( w _{j} )}{\sum _{{w_{j}}_{j \in \partial μ \ i}} Θ ( s ^{μ} ( \sum _{j \in \partial μ \ i} w _{j} + N _{\partial μ}^{fix} - θ )) \prod _{j \in \partial μ \ i} p _{j \to μ} ( w _{j} )} .

\hat{h}_{μ \to i} = lo g \frac{\sum _{{w_{j}}_{j \in \partial μ \ i}} Θ ( s ^{μ} ( 1 + \sum _{j \in \partial μ \ i} w _{j} + N _{\partial μ}^{fix} - θ )) \prod _{j \in \partial μ \ i} p _{j \to μ} ( w _{j} )}{\sum _{{w_{j}}_{j \in \partial μ \ i}} Θ ( s ^{μ} ( \sum _{j \in \partial μ \ i} w _{j} + N _{\partial μ}^{fix} - θ )) \prod _{j \in \partial μ \ i} p _{j \to μ} ( w _{j} )} .

\hat{h}_{μ \to i} = lo g \frac{H ( s ^{μ} \frac{θ - 1 - N _{\partial μ}^{fix} - a _{μ \to i}}{σ _{μ \to i}} )}{H ( s ^{μ} \frac{θ - N _{\partial μ}^{fix} - a _{μ \to i}}{σ _{μ \to i}} )},

\hat{h}_{μ \to i} = lo g \frac{H ( s ^{μ} \frac{θ - 1 - N _{\partial μ}^{fix} - a _{μ \to i}}{σ _{μ \to i}} )}{H ( s ^{μ} \frac{θ - N _{\partial μ}^{fix} - a _{μ \to i}}{σ _{μ \to i}} )},

h_{i \to μ}^{t + 1} = f ν \in \partial i \ μ \sum \hat{h}_{ν \to i}^{t} + (1 - f) h_{i \to μ}^{t} .

h_{i \to μ}^{t + 1} = f ν \in \partial i \ μ \sum \hat{h}_{ν \to i}^{t} + (1 - f) h_{i \to μ}^{t} .

c_{ij} = ⟨ w_{i} w_{j} ⟩ - ⟨ w_{i} ⟩ ⟨ w_{j} ⟩ = p (w_{i} = 1) p (w_{j} = 1∣ w_{i} = 1) - p (w_{i} = 1) p (w_{j} = 1),

c_{ij} = ⟨ w_{i} w_{j} ⟩ - ⟨ w_{i} ⟩ ⟨ w_{j} ⟩ = p (w_{i} = 1) p (w_{j} = 1∣ w_{i} = 1) - p (w_{i} = 1) p (w_{j} = 1),

\overline{XCorr} = [\frac{1}{∣ A ∣ ( N _{unfixed} - 1 )} i \in A \sum j \neq = i \sum c_{ij}], \overline{XCorr^{2}} = [\frac{1}{∣ A ∣ ( N _{unfixed} - 1 )} i \in A \sum j \neq = i \sum c_{ij}^{2}],

\overline{XCorr} = [\frac{1}{∣ A ∣ ( N _{unfixed} - 1 )} i \in A \sum j \neq = i \sum c_{ij}], \overline{XCorr^{2}} = [\frac{1}{∣ A ∣ ( N _{unfixed} - 1 )} i \in A \sum j \neq = i \sum c_{ij}^{2}],

Z = w \sum μ \prod Θ (s^{μ} (j \sum w_{j} ξ_{j}^{μ} - N / A)) i \prod exp (x (w_{i} - \tilde{w}_{i})^{2}),

Z = w \sum μ \prod Θ (s^{μ} (j \sum w_{j} ξ_{j}^{μ} - N / A)) i \prod exp (x (w_{i} - \tilde{w}_{i})^{2}),

Z (ξ, s) = w \sum X_{ξ, s} (w),

Z (ξ, s) = w \sum X_{ξ, s} (w),

X_{ξ, s} (w) = μ = 1 \prod α N Θ [s^{μ} (i = 1 \sum N w_{i} ξ_{i}^{μ} - N / A)] .

X_{ξ, s} (w) = μ = 1 \prod α N Θ [s^{μ} (i = 1 \sum N w_{i} ξ_{i}^{μ} - N / A)] .

F = \frac{1}{N} ⟨ lo g Z (ξ, s) ⟩_{ξ, s},

F = \frac{1}{N} ⟨ lo g Z (ξ, s) ⟩_{ξ, s},

F_{F P} (D) = \frac{1}{N} ⟨ \frac{\sum _{\tilde{w}} X _{ξ, s} ( w ~ ) ln ( N _{ξ, s} ( w ~ , D ))}{\sum _{\tilde{w}} X _{ξ, s} ( w ~ )} ⟩_{ξ, s},

F_{F P} (D) = \frac{1}{N} ⟨ \frac{\sum _{\tilde{w}} X _{ξ, s} ( w ~ ) ln ( N _{ξ, s} ( w ~ , D ))}{\sum _{\tilde{w}} X _{ξ, s} ( w ~ )} ⟩_{ξ, s},

F_{L D} (y, D) = \frac{1}{N y} ⟨ ln (\tilde{w} \sum N (\tilde{w}, D)^{y}) ⟩_{ξ, s} .

F_{L D} (y, D) = \frac{1}{N y} ⟨ ln (\tilde{w} \sum N (\tilde{w}, D)^{y}) ⟩_{ξ, s} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Understanding the computational difficulty of a binary-weight perceptron

and the advantage of input sparseness

$\text{Zedong Bi}^{1,2}$ and $\text{Changsong Zhou}^{1,2,3,*}$

Abstract

Limited precision of synaptic weights is a key aspect of both biological and hardware implementation of neural networks. To assign low-precise weights during learning is a non-trivial task, but may benefit from representing to-be-learned items using sparse code. However, the computational difficulty resulting from low weight precision and the advantage of sparse coding remain not fully understood. Here, we study a perceptron model, which associates binary (0 or 1) input patterns with desired outputs using binary (0 or 1) weights, modeling a single neuron receiving excitatory inputs. Two efficient perceptron solvers (SBPI and rBP) usually find solutions in dense solution region. We studied this dense-region-approaching process through a decimation algorithm, where every time step, marginal probabilities of unfixed weights were evaluated, then the most polarized weight was fixed at its preferred value. We compared the decimation-fixing order of weights with the dynamics of SBPI and rBP, and studied the structure of solution subspace $\mathcal{S}$ where early-decimation-fixed weights take their fixed values. We found that in SBPI and rBP, most time steps are spent on determining values of late-decimation-fixed weights. This algorithmic difficult point may result from strong cross-correlation between late-decimation-fixed weights in $\mathcal{S}$ , and is related to solution condensation in $\mathcal{S}$ during decimation. Input sparseness reduces time steps that SBPI and rBP need to find solutions, due to reduction of cross-correlation between late-decimation-fixed weights. When contraint density increases across a transition point, the portion of weights whose values are difficult to determine sharply increases, signaling difficult accessibility of dense solution region. We propose that the computational difficulty of binary-weight perceptron is related to a solution condensation process during approaching the dense solution region. Our work highlights the heterogeneity of learning dynamics of weights, which may help understand axonal pruning in brain development, and inspire more efficient algorithms to train neural networks.

1 Department of Physics, Centre for Nonlinear Studies and Institute of Computational and Theoretical Studies, Hong Kong Baptist University, Kowloon Tong, Hong Kong

2 Research Centre, HKBU Institute of Research and Continuing Education, Shenzhen, China

3 Beijing Computational Science Research Center, Beijing, China

[email protected]

1 Introduction

Information of neural networks is stored in synaptic weights. An important feature of the synaptic weights in nature and simple hardware implementation is low precision [1, 2, 3]. However, assigning suitable values to low-precision weights is a non-trivial task, and is known to be intractable in the worst case [4, 5]. Artificial neural networks with low-precision weights requires more epochs to train before performance convergence [6, 7]. Therefore, it is of great interest to understand where the hardness comes from and how to avoid it or use it.

Perceptron is a one-layer network that receives inputs from many neurons and gives a single output. It serves as an elementary building block of complex neural networks, and is also one of the basic structures for learning and memory [8]. It is also a powerful computational unit by itself, and has applications in, say, rule inference [9] and data compression [10]. Here we try to understand the computational difficulty induced by low weight precision by studying perceptron whose weights take binary values (0 or 1). We consider a classification task: given a set of random input patterns, we want to adjust the synaptic weights, so that the perceptron gives a desired output in response to each input pattern. A synaptic weight configuration successfully associating all these input-output pairs is a solution of the perceptron.

Methods of statistical physics imply that most solutions of binary-weight perceptron are isolated and computationally hard to find [11, 12]. In the seminal work of Ref. [13], it is shown that the weight configuration space of binary-weight perceptron has a region where solutions densely aggregate. It is believed that solutions found by algorithms usually belong to this dense solution region [13, 14], and efficient algorithms have been designed by conducting weight configurations toward the dense solution region [15, 16, 17]. This implies that the computational difficulty in solving binary-weight perceptron, if any, should emerge during the process approaching the dense solution region. This difficulty should be related to the geometry structure of solution space around the dense solution region, and has little to do with the isolated solutions. However, there is a knowledge gap about how the structure of solution space around the dense region influences algorithmic difficulty.

The “difficulty" encountered by algorithms may have two different meanings, depending on the constraint density $\alpha$ ( $\alpha=M/N$ , with $M$ being the number of input-output pairs to be associated and $N$ being the number of synapses). First is difficult part during efficient solving. When $\alpha$ takes small values, there exist algorithms (such as SBPI and rBP) which can efficiently (i.e., typically take sub-exponential-to- $N$ time steps) solve the problem [18, 19]. In this case, it is interesting to study what difficult part of the solving process consumes most time, and think of methods to reduce solving time. Second is exponential-time difficulty. When $\alpha$ takes large value, close to the theoretical capacity $\alpha_{c}^{\text{theo}}$ (which means the maximal $\alpha$ at which perceptron theoretically has solutions), there are no known efficient solvers, so that people have to wait for exponentially long time for theoretically existing solutions. In this case, it is interesting to study what nature prevents the existence of efficient solvers. As far as we know, no study has been performed to address the difficult part during efficient solving process. It has been suggested that exponential-time difficulty is related to fragmentation of dense solution region at $\alpha_{U}$ when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$ [20]. However, it is unknown how this fragmentation influences the dynamics of algorithms to prevent them from finding solutions.

A possible way to reduce the computational difficulty of binary-weight perceptron is to represent the to-be-classified items using sparse code in input patterns. The functional pros and cons of input sparseness has been discussed in many contexts. On the one hand, input sparseness allows algorithms to store more input-output (IO) associations into perceptron [21, 22, 23], and facilitates classification of input patterns [24]; on the other hand, input sparseness also reduces the information content of input patterns and reduces generalization performance of the trained network [25, 26]. Despite these possible cons, biological neural analogies of perceptron, such as cerebellar Purkenje cells and insectile mushroom body output neurons receive from a large number of cerebellar granule cells and Keynon cells [27, 28], which have low firing probability. Theoretically, the advantage of input sparseness for memory storage is mainly understood by calculating theoretical capacity $\alpha_{c}^{\text{theo}}$ . The result is that input sparseness increases $\alpha_{c}^{\text{theo}}$ [27, 29]. A point seldom discussed is how input sparsity influences the time steps that an algorithm uses to find a solution, in the case that the algorithm can “efficiently" (i.e., take sub-exponential-to- $N$ time steps) solve the perceptron. Suppose in the case that solutions can be efficiently found, sparser input requires more time steps to find a solution, then the biological and engineering interest of input sparseness should be significantly weakened. In this paper, we will study this point above by investigating how input sparsity influences the difficult part during efficient solving that consumes most time steps.

One possible aspect to understand the computational difficulty of binary-weight perceptron is cross-correlation of weights in solution space. Strong cross-correlation breaks Bethe-Peierls approximation [30], which supposes that the probability distribution of the value of one weight in solution space is independent of the values of the other weights. This approximation lies in the heart of belief propagation [30], which is an ingredient of or closely related to a number of efficient algorithms [19, 15, 18, 31]. From another aspect, strong cross-correlation implies that the value preferences of a large number of weights are reconfigured in response to the flipping of a single weight. This means that in a local search algorithm, many subsequent rearrangements may be needed following the modification of a single weight, which increases the difficulty to find solutions [32]. Additionally, cross-correlation may be related to the condensation of solutions [30, 33], which means that most solutions aggregate into a small number of clusters.

In this paper, we study a perceptron model in which both inputs and weights take binary values (0 or 1), modeling a single neuron receiving excitatory inputs. To address the knowledge gap between structure of solution space around dense solution region and algorithmic difficulty, we propose to study decimation process, in which weights are successively fixed. At each time step, marginal probabilities of unfixed weights are computed using belief propagation or, in small-sized systems, enumerated solutions, then the most polarized weight is fixed at its preferred value. We will show that similarly as two efficient algorithms (SBPI and rBP), decimation is also a process approaching the dense solution region, and the order of weight-fixing in decimation is comparable to the dynamics of these two efficient algorithms. Additionally, decimation is also a simpler process than SBPI and rBP, because a weight will not change its value again once gets fixed; unlike in SBPI and rBP, where the preferences of weights are changing continuously. Therefore, decimation is an ideal surrogate process of SBPI and rBP for theoretical study. We will study the cross-correlation between weights and the geometry of solution clusters in the solution subspace made up of unfixed weights during decimation, thereby getting insight into the reshaping of solution space in efficient algorithms during approaching the dense solution region. With the help of decimation process, we hope to shed light onto the links among solving dynamics in efficient algorithms, algorithmic difficulty and structure of solution space around the dense region.

This paper is organized as follows. After describing the model in Section 2, we will investigate the capacity of the model and time steps that SBPI and rBP need to obtain solutions at different input sparsities in Section 3. In Sections 4-7, we will study difficult solving part and the advantage of input sparseness with the help of decimation. In Sections 8 and 9, we will study exponential-time difficulty by investigating the case when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$ .

2 Model

We consider a perceptron model (Fig.1a) in which a neuron receives $N$ binary inputs $\xi_{i}=\{1,0\}$ ( $i=1,2,\cdots,N$ ), which indicates whether the $i$ th input neuron fires or not. The synaptic weights are also binary $w_{i}=\{1,0\}$ , modeling excitatory synapses. The output of the perceptron is $\tau(\mathbf{w},\mathbf{\xi})=\Theta(\mathbf{w}\cdot\mathbf{\xi}-N/A)$ , which takes 1 or 0 depending on whether its total synaptic input is larger than the firing threshold $N/A$ , where $A>0$ is an adjustable parameter. This neuron is then provided $\alpha N$ input patterns ${\xi^{\mu}}$ and desired outputs $\sigma^{\mu}$ with $\mu=1,2,\cdots,\alpha N$ . The task is to adjust the synaptic weights $\mathbf{w}$ , so that $\tau(\mathbf{w},\mathbf{\xi}^{\mu})=\sigma^{\mu}$ for all $\mu$ s. The probability that $\xi$ take 1 is input coding level $f_{in}$ . Input becomes sparser with smaller $f_{in}$ . In our model, $f_{in}$ is fixed at a nonzero value when $N\rightarrow\infty$ , so that the number of $\xi_{i}$ s that take 1 is of $\mathcal{O}(N)$ order. $\sigma^{\mu}$ has equal probability to be 0 or 1.

Now let’s discuss the meaning of $A$ . For a well-trained perceptron, it can be shown that the total synaptic current average over the $\alpha N$ patterns $\langle\mathbf{w}\cdot\mathbf{\xi}^{\mu}\rangle_{\mu}$ is equal to the firing threshold $N/A$ up to order $\mathcal{O}(\sqrt{N})$ (see Appendix E). As larger synaptic current consumes more energy, energetic efficient perceptron should reduce firing threshold by enlarging $A$ (Fig.1b, left). Of course, if firing threshold is too low, neuron may have high spontaneous activity due to membrane or synaptic noise [34], potentially consuming more energy. However, the advantage to enlarge $A$ should be apparent as long as such spontaneous activity is rare.

The meaning of $A$ can also be interpreted in another way by noting that the neuron output $\sigma=\Theta(\mathbf{w}\cdot\mathbf{\xi}-N/A)=\Theta(A\mathbf{w}\cdot\mathbf{\xi}-N)$ , where in the latter expression form, $A$ rescales the discrete values that weights can take. This weight rescaling can increase successful rate of memory retrieval under noise (Fig.1b, right). Intuitively, because $\xi_{i}^{\mu}$ s are supposed to be independent with each other, the standard deviation of the distribution of total synaptic current $I^{\mu}=A\mathbf{w}\cdot\mathbf{\xi}^{\mu}$ over $\mu$ s is approximately $\sqrt{f_{in}(1-f_{in})\sum_{i}(Aw_{i})^{2}}\approx\sqrt{AN(1-f_{in})}$ (Appendix E), which increases with $A$ ; and the input patterns with $I^{\mu}$ larger (or smaller) than the firing threshold $N$ associate with output 1 (or 0). Under noise with strength $\sigma_{noise}$ , the perceptron may not give correct outputs in response to input patterns whose $I^{\mu}$ s lie within the range $[N-\sigma_{noise},N+\sigma_{noise}]$ (red region in Fig.1b, right). However, when $A$ gets larger, the distribution of $I^{\mu}$ s becomes broader (gray curve in Fig.1b, right), so that the portion of patterns whose $I^{\mu}$ s lie within this range decreases, which means that perceptron is more likely to output the associated value in response to a stored input pattern under noise. We provide numeric evidence to support this intuition in Supplemental Material Section 1 and Supplementary Fig.1.

In a large body of literature (e.g. Ref. [35, 36]), retrieval robustness is realized by requiring that total synaptic input is above (below) firing threshold by a stability parameter $\kappa>0$ when the desired output is active (inactive). Here we effectively set $\kappa=0$ . Introducing $\kappa>0$ ensures that memory retrieval is always successful when the noise strength $\sigma_{noise}<\kappa$ , but enlarging $A$ increases the probability of successful retrieval when $\sigma_{noise}>\kappa$ .

Together, in our model, the requirements for low energy cost and high robustness for memory retrieval can be unified in a single motivation: enlarging $A$ .

3 Computational advantage of sparse input under large $A$

We evaluated the capability of our perceptron model in classification task by investigating two efficient algorithmic solvers: stochastic Belief-Propagation Inspired algorithm (SBPI) [18] and reinforced Belief Propagation (rBP) [19]. Both algorithms assign a hidden state $h_{i}$ to weight $w_{i}$ , update $h_{i}$ during solving, and take $w_{i}=0$ (or 1) when $h_{i}$ is negative (or positive), see Appendix A for more details of their implementations:

(1) SBPI is a biologically plausible on-line algorithm, in which an input-output (IO) pair is presented at each time step. If the presented IO pair is unassociated (i.e., when $\sum_{i}w_{i}\xi_{i}^{\mu}>N/A$ but $\sigma^{\mu}=0$ or when $\sum_{i}w_{i}\xi_{i}^{\mu}<N/A$ but $\sigma^{\mu}=1$ ), then hidden state $h_{i}$ is updated as $h_{i}\leftarrow h_{i}+2\xi_{i}^{\mu}(2\sigma^{\mu}-1)$ , so that $w_{i}$ may be flipped towards the direction that facilitates the association of the presented IO pair. If $\sigma^{\mu}=0$ and the presented IO pair is associated, but the synaptic current $\sum_{i}w_{i}\xi_{i}^{\mu}$ is too close to the firing threshold $N/A$ , or in other words, the presented IO pair is just on the edge of association, then $h_{i}$ is also updated, with some probability, as $h_{i}\leftarrow h_{i}-2\xi_{i}^{\mu}$ , to increase the margin between $\sum_{i}w_{i}\xi_{i}^{\mu}$ and $N/A$ .

(2) rBP is a message-passing algorithm, which uses belief propagation (BP) to update $h_{i}$ . Specifically,

[TABLE]

In the equations above, $\phi_{i\rightarrow\mu}(\cdot)$ and $\hat{\phi}_{\mu\rightarrow i}(\cdot)$ are BP update functions, $\phi_{i}(\cdot)$ is the BP estimate of single-site field, $p(t)$ is a random number that takes $1$ with probability $1-(\gamma)^{t}$ and 0 otherwise, and $f$ is a damping factor. The meaning of eqs.1-3 is to use BP to evaluate single-site field $h_{i}$ , and then in the next time step, with probability $1-(\gamma)^{t}$ add $h_{i}$ as an external field to BP.

We first studied the dependence of algorithmic capacity $\alpha_{c}^{\text{alg}}$ of SBPI and rBP on $f_{in}$ and $A$ . $\alpha_{c}^{\text{alg}}N$ means the maximal number of IO pairs that the perceptron can associate using an algorithm. We found that at a given $f_{in}$ , $\alpha_{c}^{\text{alg}}$ increases with $A$ when $A$ is smaller than an optimal value $A_{opt}$ , and decreases with $A$ when $A>A_{opt}$ (Fig.2a-c). So this $A_{opt}$ , which decreases with $f_{in}$ (Fig.2a-c), divides the function of $\alpha_{c}^{\text{alg}}$ with $A$ into two branches: low- $A$ and high- $A$ branches. Because of the non-monotonicity of $\alpha_{c}^{\text{alg}}$ with $A$ , an $\alpha_{c}^{\text{alg}}$ usually corresponds with two $A$ values at a given $f_{in}$ (Fig.2a-c). However, biologically and engineeringly good design should choose to work at the larger $A$ value, because it implies lower energy cost or higher retrieval robustness without compromising memory capacity. So only the high- $A$ branch is of biological and engineering interest, and will be considered in the following discussion. At a given $A$ , the dependence of $\alpha_{c}^{\text{alg}}$ on $f_{in}$ may not be monotonic, but if we only look at high- $A$ branch, $\alpha_{c}^{\text{alg}}$ increases with input sparseness (Fig.2a-c). This means that under the requirements of low energy consumption and robust memory retrieval, input sparseness facilitates memory capacity. From another aspect, the decrease of $\alpha_{c}^{\text{alg}}$ with $A$ at high- $A$ branch reflects the competition between capacity with energy cost or retrieval robustness, but this competition can be alleviated by increasing input sparseness: decreasing $f_{in}$ enables the same $\alpha_{c}^{\text{alg}}$ to be fulfilled using a larger $A$ (Fig.2a-c).

To understand the change of $\alpha_{c}^{\text{alg}}$ with $A$ and $f_{in}$ , we first calculated theoretical capacity $\alpha_{c}^{\text{theo}}$ using replica method (Appendix E), which means the maximal number of IO pairs that perceptron has a solution, and provides an upper bound of $\alpha_{c}^{\text{alg}}$ . We found that the change of $\alpha_{c}^{\text{theo}}$ with $f_{in}$ and $A$ follows similar profile as $\alpha_{c}^{\text{alg}}$ (Fig.2a,c): the function of $\alpha_{c}^{\text{theo}}$ with $A$ also has two branches; and at high- $A$ branch, $\alpha_{c}^{\text{theo}}$ increases with $f_{in}$ . We discuss the non-monotonicity of capacity with $A$ and $f_{in}$ in Supplemental Material Section 2 and Supplementary Fig.2, and we find that the existence of low- $A$ branch is because of the upper boundedness of weights (i.e., $w_{i}\leq 1$ ) in our model.

However, only calculating $\alpha_{c}^{\text{theo}}$ is not sufficient to understand the computational advantage of input sparseness, because it is possible that algorithms need more time steps to find a solution even though there are more solutions in the weight configuration space. How easy it is to get solutions also matters: on the one hand, algorithms usually have a upper limit of time step $T_{max}$ , and only the cases when solutions are found within $T_{max}$ time steps report solving success and contribute to $\alpha_{c}^{\text{alg}}$ ; on the other hand, solving perceptron in less time steps, which implies quick learning, is itself of biological and engineering interest.

We investigated the time steps $T_{solve}$ that SBPI and rBP use to solve under different input sparsities at a given $\alpha$ . Under SBPI, the change of $T_{solve}$ with $f_{in}$ and $A$ is largely anti-correlated with the change of $\alpha_{c}^{\text{alg}}$ : the function of $T_{solve}$ with $A$ also has two branches, and at high- $A$ branch, $T_{solve}$ decreases with input sparseness (Fig.2d). Under rBP, however, the change of $T_{solve}$ with $f_{in}$ is more subtle, dependent on the parameter $\gamma$ of rBP. At the high- $A$ branch of $\alpha_{c}^{\text{alg}}$ , at a fixed $\gamma$ , $T_{solve}$ may not decrease with input sparseness (Supplementary Fig.3b); but if we also tune $\gamma$ , defining $T_{solve}$ as the minimal time steps that rBP achieves successful solving under different $\gamma$ s, then $T_{solve}$ follows similar profile as that in SBPI (Fig.2e): $T_{solve}$ has two branches with $A$ , and decreases with input sparseness at high- $A$ branch. This means that we are, in principle, able to design a new algorithm based on rBP, which cleverly chooses $\gamma$ for quick solving based on, say, experience; and this clever-rBP algorithm requires less time steps under sparser input at high- $A$ branch. See Supplemental Material Section 3 and Supplementary Fig.3 for more discussion on the rBP case.

Together, at large $A$ value, the computational advantage of input sparseness is of two folds: (1) sparser input results in higher theoretical capacity, which means that more IO pairs can potentially be associated with sparser input; (2) at a given $\alpha$ , solutions can be obtained in fewer time steps under sparser input, which brings the theoretical potentiality of input sparseness suggested by the first point into algorithmic reality. The first point can be seen from the results of replica method, and is consistent with the discovery in previous studies [27, 29]. In Sections 4-7, we will study which part of solving process consumes most time steps and why input sparseness reduces time steps needed to find solutions with the help of decimation process.

4 Understanding the difficult part during perceptron solving through

decimation process

To understand the difficult part in solving binary-weight perceptron and the advantage of sparse input, we studied decimation process, where we fixed a single weight in each time step. Specifically, at the first time step, we evaluated the marginal probability $p_{i}$ that the $i$ th weight took 1 value in solutions. Then we chose the $j$ th weight that had the strongest value polarization (i.e., $j=\arg_{i}\min\{p_{i},1-p_{i}\}$ ), and fixed the $j$ th weight at its preferred value $a_{j}^{\text{prefer}}$ (i.e., $a_{j}^{\text{prefer}}=0$ or 1 if $p_{j}<0.5$ or otherwise), and eliminated the solutions with $w_{j}\neq a_{j}^{\text{prefer}}$ from consideration in further time steps. At the $k$ th time step, we evaluated $p_{i}$ of unfixed weights in the left solutions, then fixed the most polarized unfixed weight and eliminated solutions in the same way as above. By iteratively doing this, we could hopefully fix all the weights, and got a solution of the perceptron problem. The word “decimation" means removing variable node from factor graph [37].

When the marginal probabilities $p_{i}$ in each time step are computed using belief propagation, the successive weight-fixing process above has the name belief-propagation-guided decimation (BPD). In theoretical interest, in small-sized systems, we can get the exact value of $p_{i}$ using enumerated solutions, which is called exact decimation.

The motivation why we studied decimation process is that BPD is heuristic from similar idea with rBP and SBPI: BPD is equivalent to adding an external field of infinite intensity to the fixed weights, while rBP is a sort of smooth decimation in which each variable gets an external field with intensity proportional to its polarization [19]; and SBPI is an on-line algorithm heuristic from rBP [18]. So the dynamic processes of these three algorithms should be comparable in some sense. Additionally, in decimation process, weight values will not be changed again once fixed, so this process can be studied in a more controlled manner, unlike in rBP and SBPI, where the preferences of weights are changing continuously.

We compared the dynamics of SBPI and rBP with the weight-fixing order in BPD. We found that for weights fixed early in BPD, the hidden states in SBPI or rBP quickly deviate from zero in the first few time step, and hardly change their signs in the subsequent solving process; however, for weights fixed late in BPD, their hidden states keep close to zero, prone to change their signs in the subsequent solving process (Fig.3a). We then defined fixing time $t_{i}^{\text{fix}}$ of weight $w_{i}$ as the last time step that the hidden state $h_{i}$ changes its sign during SBPI or rBP. We found $t_{i}^{\text{fix}}\approx 0$ for early-BPD-fixed weights; while $t_{i}^{\text{fix}}$ is large for late-BPD-fixed weights, larger under denser input (Fig.3b,c). This means that SBPI and rBP can assign values to early-BPD-fixed weights in the first few time steps, while most time steps are spent on determining the values of late-BPD-fixed weights; and spending less time for these late-BPD-fixed weights is the key reason for faster solving under sparser input.

We also calculated the difference of $w_{i}$ in solutions found by SBPI or rBP from that found by BPD:

[TABLE]

where $w_{i}^{\text{BPD}}$ is the value of $w_{i}$ in the solution found by BPD, $w_{i}^{\mathbf{x}}$ is the value of $w_{i}$ in a solution $\mathbf{x}$ found by SBPI or rBP, $\langle\cdot\rangle_{\mathbf{x}}$ means averaging over the solutions found by SBPI or rBP under different seeds of random number generator, and $[\cdot]$ means quenched average (i.e., average over different sets of IO pairs). We found that $\Delta_{i,\text{BPD}}\approx 0$ for early-BPD-fixed weights (Fig.3d,e), indicating that $w_{i}^{\mathbf{x}}=w_{i}^{\text{BPD}}$ in most cases; while $\Delta_{i,\text{BPD}}\approx 0.5$ for late-BPD-fixed weights (Fig.3d,e), just like the case when $w_{i}^{\mathbf{x}}$ takes 0 or 1 randomly. This result indicates that these algorithms (BPD, SBPI and rBP) have hardly any discrepancy in what values that early-BPD-fixed weights should take, while the values of late-BPD-fixed weights are prone to stochasticity during solving.

We tested the results above in large systems ( $N=10000$ ), and found similar results with those in Fig.3 where $N=480$ . Notably, the relative position $O_{BPD}/N$ (with $O_{BPD}$ being BPD fixing order) at which $t_{i}^{\text{fix}}$ and $\Delta_{i,\text{BPD}}$ start to ramp up do not significantly differ in $N=480$ and $N=10000$ cases (Supplementary Fig.5e-h). This implies that the portion of weights that are difficult to determine remains finite when $N\rightarrow\infty$ .

Together, during SBPI and rBP, the values of early-BPD-fixed weights are very easy to determine, and have little disagreement among algorithms; most solving time steps are spent on determining the values of late-BPD-fixed weights after the early-BPD-fixed weights are fixed to their little-disputed values. Under sparser input, the fact that solutions can be found in less time steps is mostly because less time steps are needed for determining the values of late-BPD-fixed weights. In other words, the difficult part during solving binary-weight perceptron and also the advantage of input sparseness should originate from the subspace of weight configuration space in which early-BPD-fixed weights are fixed to their little-disputed values.

5 Approaching dense solution region through decimation

Previous studies suggest that solutions of binary-weight perceptron are typically isolated [11], but there is a spatial region in the weight configuration space where solutions densely aggregate [13, 20]. It is believed that solutions in the dense region have good generalization performance, and are those found by efficient algorithms [13, 15, 16, 17]. Similar scenario also exists in our model. Using the replica method introduced in Ref. [11, 13], we calculated the local entropy $F_{local}(D)$ (i.e., logarithm of solution number at distance $D=d/N$ from a weight configuration) from a typical solution or a configuration in the dense solution region (see Appendix E for details). Here, $d$ means Hamming distance between two weight configurations, which is the number of weights that take different values in the two weight configurations. Consistent with the findings in Ref. [11, 13, 15], we found that $F_{local}(D)$ from a typical solution is negative at small $D$ (Fig.4a), suggesting that solutions are typically isolated; however, from a configuration in the dense solution region, at small $D$ , $F_{local}(D)$ tends to its upper bound, which is the local entropy in the case that all weight configurations are solutions (Fig.4b), suggesting extremely dense solution aggregation. Using belief propagation (Appendix D), we found that the local entropy from a solution $\mathbf{w}_{0}$ found by SBPI or rBP is higher than that from a solution after long-term random walk (Appendix G) starting from $\mathbf{w}_{0}$ (Fig.4c,d), suggesting that solutions found by SBPI or rBP are close to the dense solution region.

Where does the dense solution region locate? To answer this question, we calculated the difference of $w_{i}$ in a given solution $\mathbf{x}$ from that in the solution found by BPD: $\Delta_{i,\text{BPD}}(\mathbf{x})=(w_{i}^{\mathbf{x}}-w_{i}^{\text{BPD}})^{2}$ . We found that for solutions $\mathbf{w}_{0}$ found by SBPI or rBP, $\Delta_{i,\text{BPD}}(\mathbf{w}_{0})$ is close to zero for $i$ s early fixed in BPD (Fig.4e,f). However, for solutions reached after long-term random walk starting from $\mathbf{w}_{0}$ , $\Delta_{i,\text{BPD}}$ is significantly larger in these early-BPD-fixed weights (Fig.4e,f). This result, together with the calculation of local entropy (Fig.4c,d), suggests that solutions with small $\Delta_{i,\text{BPD}}$ in early-BPD-fixed weights have high local entropy. In other words, the dense solution region locates in the subspace where the early-BPD-fixed weights take their fixed values. These early-fixed weights tend to have strong polarization (i.e., small $\min\{p_{i}(1),1-p_{i}(1)\}$ , with $p_{i}(1)$ being the probability that $w_{i}=1$ in solution space, see Supplementary Fig.7e). Therefore, in a simple picture, solutions in the dense region are those whose strong-polarized weights take their preferred values.

For the convenience of the following discussion, we define solution subspace $\mathcal{S}(N_{\text{fix}})$ to be all the solutions in which the first fixed $N_{\text{fix}}$ weights during a weight-fixing process take their fixed values.

To better understand the dense-region approaching process during BPD, we studied small-sized systems in which all solutions can be exactly enumerated out. We calculated local entropy of the enumerated solutions in $\mathcal{S}(N_{\text{fix}})$ during BPD:

[TABLE]

where $\Theta(x)$ is Heaviside step function which is 1 when $x>0$ and 0 otherwise, $\mathcal{N}(\mathbf{w},d)$ is the number of solutions at Hamming distance $d$ from a configuration $\mathbf{w}$ , $\langle\cdot\rangle_{\mathbf{w}\in\mathcal{S}(N_{\text{fix}})}$ means average over the solutions in $\mathcal{S}(N_{\text{fix}})$ , and $[\cdot]$ represents quenched average (i.e., average over sets of IO pairs). We found that $F_{local}(d,N_{\text{fix}})$ at small $d$ increases with $N_{\text{fix}}$ during BPD (Fig.4g), and the solutions found by BPD after fixing all weights have higher-than-average local entropy at small $d$ (Fig.4h). These results further support BPD as a process approaching the dense solution region.

As an intuitive understanding how BPD can be related with dense-solution-region approaching, note that after we fix $w_{j}$ at its preferred value $a_{j}^{\text{prefer}}$ at time step $N_{\text{fix}}$ , we eliminate the solutions with $w_{j}\neq a_{j}^{\text{prefer}}$ from $\mathcal{S}(N_{\text{fix}}-1)$ . By fixing the most polarized weights at its preferred value, we reduce the number of unfixed weights $N_{\text{unfixed}}$ while eliminating the least number of solutions. As a result, solution density in the subspace of unfixed weights should continually increases during this process. Consistent with this scenario, we found that solution entropy density in the subspace of unfixed weights $\frac{1}{N_{\text{unfix}}}\log(|\mathcal{S}(N_{\text{fix}})|)$ (with $|\mathcal{S}(N_{\text{fix}})|$ denoting the set size of $\mathcal{S}(N_{\text{fix}})$ ) continually increases during BPD (Fig.4i).

We also tried exact decimation in small-sized systems, in which the marginal probabilities used for weight fixing is calculated with the enumerated solutions, and also found the increase of local entropy and solution entropy in $\mathcal{S}(N_{\text{fix}})$ with the progress of weight fixing (Supplementary Fig.6a-c). This means that the dense-region approaching phenomenon observed in BPD is universal in decimation algorithms, instead of an artifact introduced by belief propagation.

6 Cross-correlation of unfixed weights in solution subspace $\mathcal{S}(N_{\text{fix}})$

From the two subsections above, we see that the subspace of late-BPD-fixed weights contains the dense solution region, and determining the values of these late-BPD-fixed weights is the main difficulty in solving binary-weight perceptron. To better understand this difficulty as well as the advantage of input sparseness, we computed cross-correlation within the subspace of solutions $\mathcal{S}(N_{\text{fix}})$ during BPD:

[TABLE]

where $w_{i}$ and $w_{j}$ are two unfixed weights, and $\langle\cdot\rangle$ means average over solutions in $\mathcal{S}(N_{\text{fix}})$ .

Before decimation, $c_{ij}$ distributes narrowly around zero. With the progress of BPD, the distribution of $c_{ij}$ becomes broader (Fig.5a). Consistently, the mean square cross-correlations

[TABLE]

increases with BPD progress (Fig.5b,d, Supplementary Fig.7a,c). Cross-correlations are negative-dominated: their mean value

[TABLE]

is negative and decreases with BPD progress (Fig.5c,e, Supplementary Fig.7b,d). This indicates strong cross-correlation between late-BPD-fixed weights. This strong cross-correlation, which breaks Bethe-Peierls approximation, is probably the reason for the difficulty in assigning values to late-BPD-fixed weights, and why SBPI and rBP spend so many times steps on assigning values to late-BPD-fixed weights (Fig.3b,c). At the late stage of BPD, with sparser input, $\overline{\text{XCorr}^{2}}$ is smaller and $\overline{\text{XCorr}}$ is less negative (Fig.5b-e, Supplementary Fig.7a-d). The reduction of cross-correlation between late-BPD-fixed weights may be the reason for quicker solving under sparser input.

Notably, at the early stage of BPD, both $\overline{\text{XCorr}^{2}}$ and $|\overline{\text{XCorr}}|$ are small, and they do not change much with input sparsities (Fig.5b-e, Supplementary Fig.7a-d). This implies that the difficulty in solving binary-weight perceptron and also the computational advantage of input sparseness come from the structure of solution space near the dense solution region, and cannot be unveiled using equilibrium analysis where all solutions have equal statistical contribution.

We also investigated $\overline{\text{XCorr}^{2}}$ and $\overline{\text{XCorr}}$ during exact decimation in small-sized systems, and found similar profile how $\overline{\text{XCorr}^{2}}$ and $\overline{\text{XCorr}}$ change with $N_{\text{fix}}$ and $f_{in}$ (Supplementary Fig.6d,e).

7 Understanding the weight cross-correlation in $\mathcal{S}(N_{\text{fix}})$

through geometry of solution clusters

In this subsection, we will try to understand the cross-correlation between unfixed weights in $\mathcal{S}(N_{\text{fix}})$ from the geometry of solution clusters.

A solution cluster in $\mathcal{S}(N_{\text{fix}})$ is defined as a set of solutions that can be connected by flipping a single unfixed weight. A cluster is used to represent a pure state, which is a set of weight configurations that are separated from other configurations by infinite free-energy barrier. In other words, from a configuration in a pure state, the system needs infinitely long time to access a configuration outside of this pure state under natural thermodynamics. Strictly speaking, it is not known how to adapt the definition of pure states to instances of finite size. Here we adopt the suggestion in Ref. [38, 39], and numerically identify a pure state in $\mathcal{S}(N_{\text{fix}})$ as a solution cluster.

Under 1-step replica symmetry breaking (1RSB) ansatz, the overlap between solutions in the same cluster (or different clusters) is a delta function locates at $q_{\text{same}}$ (or $q_{\text{diff}}$ ). In this case, mean square cross-correlation can be related with solution overlap in the limit $N_{\text{unfix}}\rightarrow\infty$ as (Supplemental Material Section 5)

[TABLE]

where $x_{*}$ is Parisi parameter, which is the probability that two solutions belong to different clusters. Numerically, $x_{*}$ is defined as

[TABLE]

where $N_{\gamma}$ is the number of solutions in pure state $\gamma$ , and $N_{solution}$ is the total number of solutions. Using the replica method introduced in Ref. [39], it can be shown that in the full solution space $\mathcal{S}(N_{\text{fix}}=0)$ of infinite-sized systems, $x_{*}=1$ and $q_{\text{same}}=q_{\text{diff}}$ (see Appendix F for discussion and Appendix Fig.F1 for numeric support), which from eq.9 implies zero cross-correlation. Possibly because of this zero cross-correlation, belief propagation can give entropy landscape closely matching that predicted by replica method in the solution space before decimation [40].

In the following context, we will try to understand the cross-correlation between unfixed weights through eq.9, by investigating 1-weight-flip clusters in $\mathcal{S}(N_{\text{fix}})$ using enumerated solutions of small-sized systems. Strictly speaking, eq.9 is valid only when $N_{\text{unfix}}\rightarrow\infty$ . Here, by investigating small-sized systems, we hope to get some understanding on the change of cross-correlation during decimation observed in finite-sized systems (Fig.5), and also the possible phase transition ( $\overline{\text{XCorr}^{2}}$ transits from zero to nonzero) during decimation in infinite-sized systems.

We found that $x_{*}$ decreases with $N_{\text{fix}}$ during decimation (Fig.7a,e), which implies a solution condensation process. One possible scenario is as follows. Before decimation, the solution space contains sub-exponential number of large clusters with exponentially many solutions in each, and exponentially many small clusters with sub-exponential number of solutions in each [39] (Appendix F). The dense solution region presumably lies in a region where most solutions belong to large clusters. Therefore, in the solution subspace $\mathcal{S}(N_{\text{fix}})$ at the late stage of decimation, solutions may be more condensed in large clusters (Fig.6). With this solution condensation, $x_{*}$ decreases from 1, because two randomly chosen solutions become more likely to locate in the same large pure state. The decrease of $x_{*}$ from 1 increases weight cross-correlation through eq.9.

We tested a hypothesis in the discussion above: most solutions around the dense solution region come from large clusters. To this end, we calculated a $k$ index, defined as

[TABLE]

where $N_{\gamma\subset\mathcal{S}(N_{\text{fix}}=0)}(\mathbf{w})$ means the number of solutions in cluster $\gamma$ that contains solution $\mathbf{w}$ , and the subscript $\gamma\subset\mathcal{S}(N_{\text{fix}}=0)$ means that $\gamma$ is a cluster defined in the full solution space $\mathcal{S}(N_{\text{fix}}=0)$ before weight fixing. Therefore, $k(N_{\text{fix}})$ means the average entropy of the clusters in $\mathcal{S}(N_{\text{fix}}=0)$ that solutions in $\mathcal{S}(N_{\text{fix}})$ belong to. We found that $k(N_{\text{fix}})$ increases with $N_{\text{fix}}$ (Fig.7b,f); and that $k(N_{\text{fix}}=N)$ , which is the entropy of the cluster that contains the solution obtained after fixing all weights, is close to the entropy of the largest cluster in $\mathcal{S}(N_{\text{fix}}=0)$ (Fig.7b,f). These results support the scenario depicted in Fig.6: solutions in the neighborhood of the dense solution region mostly belong to large clusters.

We found that $x_{*}<1$ before decimation (i.e., when $N_{\text{fix}}=0$ ), see Fig.7a,e. This is probably a finite-size effect: replica method in Ref.[39] (Appendix F) predicts that $x_{*}=1$ before decimation; and numerically, we found that $x_{*}$ before decimation approaches to 1 when $N$ gets large (Appendix Fig.F1).

We found that $q_{\text{same}}-q_{\text{diff}}$ tends to increase with decimation progress, and decrease with input sparseness (Fig.7c,g). This means that around the dense solution region, the difference of overlaps within and between clusters is larger than that in full solution space, and gets smaller with sparser input. This point contributes to the increase of weight cross-correlation during decimation, and also the reduction of weight cross-correlation under sparse input at the late stage of decimation through eq.9.

We compared $\overline{\text{XCorr}^{2}}$ calculated from eq.9 with that directly calculated by eq.7. We found that eq.9 cannot reproduce the value of $\overline{\text{XCorr}^{2}}$ , but it can manifest some features of the profile how $\overline{\text{XCorr}^{2}}$ changes with $N_{\text{fix}}$ and $f_{in}$ (Fig.7d,h): (1) $\overline{\text{XCorr}^{2}}$ increases with $f_{in}$ ; (2) $\overline{\text{XCorr}^{2}}$ increases with $N_{\text{fix}}$ , especially for BPD when $f_{in}$ is large and for exact decimation. This suggests that eq.9 provides a promising aspect to understand the strong weight cross-correlation at the late stage of decimation, and the reduction of weight cross-correlation under sparse input.

Possible reasons why eq.9 cannot reproduce the value of $\overline{\text{XCorr}^{2}}$ include: (1) finite size of $N_{\text{unfix}}$ , (2) failure of 1RSB ansatz, (3) improper numeric definition of pure state. The first and second points undermine the conditions to establish eq.9. As for the third point, as mentioned in Ref. [38], a shortcoming of defining pure state using 1-weight flip is that it cannot describe entropic barrier. For example, suppose there are two subsets of solutions. Both subsets are densely intra-connected by 1-weight flip, but there is only a single long 1-weight flip routine connecting the two subsets. According to the 1-weight-flip definition of pure state, these two subsets should belong to the same pure state. However, it is very hard to access a subset starting from the other through this single routine using random-walk dynamics when $N$ is large. So according to the physical definition of pure state mentioned at the beginning of this subsection, these two subsets should belong to different pure states.

8 The case when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$

In previous sections, we studied the cases when $\alpha$ is not so close to theoretical capacity $\alpha_{c}^{\text{theo}}$ , so that SBPI and rBP can efficiently solve the perceptron problem. In this section, we will study the case when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$ , so that SBPI and rBP are no longer able to solve the problem in reasonable time. This implies that in this case, people have to wait for exponential-to- $N$ time steps for theoretically existing solutions. Here, we will try to understand the origin of this exponential-time difficulty by investigating the dynamics of SBPI and rBP when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$ , keeping $f_{in}=0.5$ .

As the first step, we studied the reshaping of dense solution region with $\alpha$ by calculating large-deviation local entropy $F_{local}(D)$ , which quantifies the number of solutions at distance $D$ from a configuration in the dense solution region. Similarly as in previous studies [15, 20], we found a transition point $\alpha_{U}$ of $F_{local}(D)$ (Fig.8a): when $\alpha<\alpha_{U}$ , saddle point equations always have solutions, so that we can calculate $F_{local}(D)$ at $D$ close to zero; when $\alpha>\alpha_{U}$ , however, saddle point equations stop having solutions at some value of $D>0$ . As suggested in Ref.[15, 20], $F_{local}(D)$ may be no longer monotonic when $\alpha>\alpha_{U}$ , which implies that the dense solution region fragments into separate regions. After noting that no known solvers have algorithmic capacity larger than $\alpha_{U}$ , Ref.[20] suggests that $\alpha_{U}$ signals a transition between easy and hard solving phase.

We have shown that SBPI and rBP spend most time steps on determining late-decimation-fixed weights (Fig.3), and with weight fixing, the subspace of unfixed weights gets shrinking to the dense solution region (Fig.4). Therefore, if the structural transition of the dense region at $\alpha_{U}$ is indeed responsible for the difficult solving when $\alpha>\alpha_{U}$ , there should be some observable effect on late-decimation-fixed weights around $\alpha_{U}$ .

To study this possible effect, we investigated the dynamics of SBPI and rBP at a few $\alpha$ values (Fig.8). With the increase of $\alpha$ , the probabilities that rBP and SBPI can solve the problem decrease (Fig.8b). No matter whether SBPI or rBP can solve successfully, we defined $t_{i}^{\text{fix}}$ as the last time step that the hidden state $h_{i}$ of weight $w_{i}$ changed its sign, and $\Delta_{i,\text{BPD}}$ as the difference of the value of $w_{i}$ that SBPI or rBP ended up from the value that BPD ended up. We considered the following two cases. Case I: when $\alpha$ took small values so that SBPI (or rBP) succeeded to solve with high probability, we investigated $t_{i}^{\text{fix}}$ and $\Delta_{i,\text{BPD}}$ in the trials when solving succeeded. Case II: when $\alpha$ took large values so that SBPI (or rBP) failed to solve with high probability, we investigated $t_{i}^{\text{fix}}$ and $\Delta_{i,\text{BPD}}$ in the trials when solving failed. We studied the functions $t_{i}^{\text{fix}}(O_{\text{BPD}})$ and $\Delta_{i,\text{BPD}}(O_{\text{BPD}})$ , where $O_{\text{BPD}}$ is BPD fixing order, with particular interest in the relative position $O_{\text{BPD}}/N$ where these two functions start to ramp. $O_{\text{BPD}}/N$ estimates the portion of weights whose values can be easily determined during algorithmic solving.

Interestingly, we found that in Case I, the functions of $\Delta_{i,\text{BPD}}(O_{\text{BPD}})$ at different $\alpha$ values almost overlap (Fig.8e,h, lower panels); and after normalizing the maximum of $t_{i}^{\text{fix}}$ to 1, functions $t_{i}^{\text{fix}}(O_{\text{BPD}})$ at different $\alpha$ s also almost overlap (Fig.8e,h, upper panels). This suggests that the portion of weights whose values are easy to determine does not change much with $\alpha$ at small values of $\alpha$ when perceptron problem can be easily solved. In Case II, we recognized a sharp decrease of start-ramping position around $\alpha_{U}$ (Fig.8c,f); and this decrease gets shaper with larger $N$ (Fig.8d,g). Note that we only plotted start-ramping positions of $\Delta_{i,\text{BPD}}$ with $\alpha$ in Fig.8d,g, because $\Delta_{i,\text{BPD}}$ has more clear-cut start-ramping positions. The start-ramping positions of $t_{i}^{\text{fix}}$ are largely comparable to those of $\Delta_{i,\text{BPD}}$ (Fig.8c,f). Collectively, it seems that the portion of weights whose values are difficult to determine does not change much with $\alpha$ when $\alpha<\alpha_{U}$ , but gets sharply increased around $\alpha_{U}$ (Fig.8i). This sharp increase of difficult weights may be responsible for the difficult solving when $\alpha>\alpha_{U}$ .

9 The geometry of solution clusters when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$

To get more understanding on the exponential-time difficulty encountered by algorithms when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$ , we performed exact decimation in small-sized systems, and studied the geometry of solution clusters connected by single weight flipping.

As the first step, similarly as in Fig.7e-h, we studied cross-correlation in the solution subspace $\mathcal{S}(N_{\text{fix}})$ after fixing $N_{\text{fix}}$ weights during exact decimation. We found that mean square cross-correlation $\overline{\text{XCorr}^{2}}$ increases with both $N_{\text{fix}}$ and $\alpha$ (Fig.9a). We found that Parisi parameter $x_{*}$ decreases with $N_{\text{fix}}$ (Fig.9b), suggesting a solution condensation process during decimation, and that $q_{\text{same}}-q_{\text{diff}}$ increases with $N_{\text{fix}}$ and $\alpha$ (Fig.9c). The $\overline{\text{XCorr}^{2}}$ value calculated by eq.9 can reproduce the profile how $\overline{\text{XCorr}^{2}}$ changes with $N_{\text{fix}}$ and $\alpha$ (Fig.9d). Notably, $x_{*}$ increases with $\alpha$ (Fig.9b), which, when $x_{*}$ is close to 1, reduces $\overline{\text{XCorr}^{2}}$ from eq.9. Therefore, the observed increase of $\overline{\text{XCorr}^{2}}$ with $\alpha$ (Fig.9d) comes from the increase of $q_{\text{same}}-q_{\text{diff}}$ . This increase of $\overline{\text{XCorr}^{2}}$ with $\alpha$ should at least contribute to the exponential-time difficulty when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$ .

We then visualized the geometry of solution clusters using multi-dimensional scaling (MDS). MDS maps high-dimensional data points into 2-dimensional space while preserving inter-point distances to the most extent. Using MDS, we hope to get some intuitive understanding on the spatial reorganization of solution clusters with decimation and $\alpha$ .

In Fig.9e, we illustrate the solution clusters during the decimation processes of three perceptron instances at three $\alpha$ values. Each of these instances was chosen so that its solution number took the median value in a number of random instances at the corresponding $\alpha$ . In this way, the illustrated instances are hopefully representative to manifest the reorganization of solution clusters with decimation and $\alpha$ .

When $\alpha$ takes small value, in the full solution space before decimation (upper left corner of Fig.9e), a large cluster dominates. It is hard to spatially separate this large cluster from the other small clusters by depicting boundaries between them, which suggests that the large cluster spatially intertwines with small clusters. Using the replica method introduced in Ref.[39] (see Appendix F), we know that $q_{\text{same}}=q_{\text{diff}}$ before decimation, and that $q_{\text{same}}$ is contributed by overlaps between solutions in the same large cluster. Here we show that the phenomenon $q_{\text{same}}=q_{\text{diff}}$ is possibly because that the large cluster is highly spatially dispersed, filling the same spatial region together with small clusters, so that two solutions in the large cluster or in different clusters have almost the same overlap. With decimation, the number of clusters get reduced, consistent with the solution condensation scenario. At the late stage of decimation, different clusters get easier to spatially separate (right column of Fig.9e), consistent with the observation that $q_{\text{same}}-q_{\text{diff}}$ increases with decimation (Fig.9c).

With the increase of $\alpha$ , solution clusters get easier to separate even in the full solution space before decimation (lower left corner of Fig.9e). However, if we accept that the result $q_{\text{same}}=q_{\text{diff}}$ is valid at all $\alpha$ values, large clusters should also disperse even when $\alpha\rightarrow\alpha_{c}$ , so the observed easy separation should be a finite-size effect. A conjecture that is consistent with the current knowledge is that when $\alpha>\alpha_{U}$ , spatial separation of clusters happens at very early stage of decimation, due to the fragmentation of the dense solution region. This spatial separation induces high $q_{\text{same}}-q_{\text{diff}}$ , resulting in high cross-correlation through eq.9 after decimating only a few weights.

10 Conclusion and Discussion

In this paper, we tried to understand the computational difficulty of binary-weight perceptron and the advantage of input sparseness in classification task. The novelty of our work is we regard decimation process as reference to study the dynamics of two efficient solvers (SBPI and rBP) and use decimation process to study the reshaping of the structure of solution space during approaching the dense solution region. We studied two types of algorithmic difficulty: (1) difficult part during efficient solving, with particular interest in the advantage of input sparseness and (2) exponential-time difficulty, which happens when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$ . We found that in the case of successful solving, most time steps in SBPI and rBP are spent on determining the values of the weights late fixed in decimation, which is the difficult solving part. Under biologically plausible requirements for low energy consumption and high robustness of memory retrieval, perceptron can be solved in fewer time steps with sparser input, because the values of late-decimation-fixed weights become easier to determine. We then tried to understand the difficulty in assigning late-decimation-fixed weights and the advantage of input sparseness from the aspect of cross-correlation of unfixed weights during decimation. We found that this cross-correlation increases with decimation progress and decreases with input sparseness. We then studied the geometry structure of clusters in the subspace of unfixed weights during decimation. We proposed that the change of cross-correlation with decimation progress and input sparsity may be related to solution condensation in the subspace of unfixed weights during decimation, and also the fact that the difference $q_{\text{same}}-q_{\text{diff}}$ between overlaps of solutions in the same and different clusters increases with decimation progress and decreases with input sparsity. To get understanding of exponential-time difficulty, we compared the dynamics of SBPI and rBP with weight-fixing order in BPD when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$ , and found that the portion of weights whose values are difficult to determine sharply increases around $\alpha_{U}$ . By studying small-sized systems using exact enumeration, we suggest that this large portion of difficult weights is because that spatial separation of solution clusters happens at very early stage of decimation, due to the fragmentation of the dense solution region around $\alpha_{U}$ .

Binary weight perceptron belongs to constraint satisfaction problems (CSP). At present, we still do not have a full understanding on the computational difficulty of CSP. In this paper, we suggest that solution condensation during decimation is an important aspect to understand the difficulty in solving CSP. This viewpoint is supported by Ref.[41], where a uniform BP decimation process was studied in XORSAT and SAT problems. In uniform BP decimation, the variable to be fixed at each time step is randomly chosen, unlikely in the BPD studied in this paper, in which the to-be-fixed variable in each time step is the most biased variable. Uniform BP decimation, in the ideal case, will end up with a uniformly chosen solution from the whole solution set, so it is unsuitable for studying the process approaching the dense solution region. However, statistical properties during uniform BP decimation can be computed using cavity method, which are helpful for understanding the computational difficulty during decimation. It was found that the $\alpha$ value at which uniform BP decimation fails to find solution coincides well with the $\alpha$ value at which condensation occurs during decimation [41]. In the context of binary perceptron, another observation that supports the importance of solution condensation for understanding the exponential-time difficulty when $\alpha>\alpha_{U}$ is the close-to-zero external entropy in the large-deviation calculation of the dense solution region [15]. External entropy measures the logarithm of the number of configurations that can be defined as the center of the dense region. Therefore, zero external entropy means that there are only sub-exponential number of configurations can be regarded as the center of the dense region. Therefore, when $\alpha>\alpha_{U}$ , the dense region should fragment into sub-exponential pieces, each of which surrounds a center. In this case, the scenario of solution condensation into sub-exponential clusters naturally occurs in the subspace around the dense region.

It is tempting to understand exponential-time difficulty and difficult solving part during efficient solving from a unified viewpoint. From eq.9, condensation (i.e., $x_{*}<1$ ) implies finite cross-correlation in large $N$ limit. If finite cross-correlation implies exponential solving time, it is reasonable to hypothesize that strong but $o(1)$ order of cross-correlation implies difficult solving part, at which solving takes sub-exponential, but many time steps. In Supplementary Fig.8, we studied the scaling of mean square cross-correlation ( $\overline{\text{XCorr}^{2}}$ ) with $N$ during decimation at a low $\alpha$ value at which BPD can find solution with high probability. We found that although $\overline{\text{XCorr}^{2}}$ gets high at late stage of decimation, it tends to zero when $N$ gets large (Supplementary Fig.8a-c). In Fig.7, we observed that $x_{*}<1$ and $q_{\text{same}}-q_{\text{diff}}>0$ during decimation. If $\overline{\text{XCorr}^{2}}\rightarrow 0$ when $N\rightarrow\infty$ , then at least one of these two observations should be a finite-size effect. Our numerical results support both $x_{*}\rightarrow 1$ and $q_{\text{same}}-q_{\text{diff}}\rightarrow 0$ with $N\rightarrow\infty$ (Supplementary Fig.8d-i). The scaling speeds of $x_{*}$ and $q_{\text{same}}-q_{\text{diff}}$ with $N$ control the time steps needed to find solutions when solving is efficient.

One influential viewpoint is that exponential-time difficulty comes from frozen variables in clusters [42, 12, 43], which are variables that take the same value in solutions in the same cluster. However, when solving CSPs in which all solutions are isolated, or in other words, all variables are frozen in every cluster, BPD can still evaluate the preferences of variables at the early stage, while hardness emerges during BPD progress [12]. This implies that the freezing in the full solution space does not hamper variable assignment at the beginning of BPD, while difficulty comes from the subspace of unfixed variables during BPD. In other words, freezing may not be the origin of computational difficulty, although it leads to this difficulty. From the solution condensation scenario depicted in this paper, we presume that the difficulty during BPD in Ref. [12] may be related to a solution elimination process: the number of isolated solutions in the subspace of unfixed variables decreases during BPD, and difficulty arises when this number becomes finite so that $x_{*}<1$ , indicating solutions are condensed into a small number of clusters; the isolation of solutions in Ref. [12] results in large $q_{\text{same}}-q_{\text{diff}}$ , which makes large cross-correlation from eq.9. In other words, the number of un-eliminated solutions decreases with decimation, until, if solving is successful, only one solution is obtained after fixing all variables. So during decimation, there should be some steps at which the solution number is finite. Because of the isolation nature of the solutions in the model of Ref.[12], the number of clusters at these steps is also finite, resulting in finite cross-correlation. Therefore, it is hard to tell whether the solving failure observed in Ref.[12] results from freezing or from condensation during decimation.

The idea that solution condensation causes exponential-time difficulty was also evaluated in Ref.[42]. It was found that algorithms can efficiently found solutions even when $\alpha$ is larger than the condensation transition point $\alpha_{c}$ , so that the solution space is dominated by sub-exponential number of large pure states. This observation argues against the relevance of condensation with computational difficulty. However, in our view, for a fair comparison between condensation point and the point of algorithmic failure, one should use the same sampling measure. Specifically, an algorithm may be regarded as a configuration sampler. On the one hand, during on-going solving process, it continually samples configurations with positive energy. On the other hand, the algorithm stops once it encounters a solution (i.e., zero-energy configuration), so a solution is sampled at each successful solving trial. However, the probability measure that a configuration is sampled by the algorithm may not coincide with the Boltzmann measure widely used in statistical calculation. So it is of no surprise that the condensation transition point $\alpha_{c}$ based on Boltzmann measure cannot well describe the algorithmic failure. In other words, the energetic landscape calculated with Boltzmann measure may be different from the energetic landscape encountered by the algorithm. A good example of this discrepancy is binary-weight perceptron studied in this paper, in which typical solutions are isolated [11], but algorithms tend to find solutions in the dense region [13]. In the model of Ref.[42], one possible scenario is that when $\alpha>\alpha_{c}$ , algorithms tend to sample configurations near the largest pure state, so that this single pure state dominates the subspace where algorithms usually sample, which makes algorithmic solving simple. Moreover, as we mentioned previously, uniform BP decimation algorithm samples all solutions uniformly; and the failure of this algorithm is well predicted by condensation during decimation calculated using Boltzmann measure [41].

The dense solution region is called “sub-dominant" in Ref. [13], because solutions in the dense region are so rare and cannot be unveiled using equilibrium analysis. We found solutions in the dense region are those whose strong-polarized weights take their preferred values, which may provide an aspect to understand the scarcity of solutions in the dense region: in solutions of the dense region, all strong polarized weights take their preferred values, while in typical solutions, some strong polarized weights violate their preferences (manifested by Fig.4e,f); typical solutions are much more than solutions in dense region because there are combinatorially many ways to choose the weights that violate their preferences. As a simple example, suppose a perceptron with 5 weights has six solutions: 00000,10000,01000,00100,00010 and 00001. Here, each digit indicates the value of a weight (e.g., 10000 means that the value of the first weight is 1, while the other four weights are zero). In this case, each weight has probability 0.833 to be at 0, so according to our finding (i.e., dense solution region is made up solutions in which strong polarized weights take their preferred values), the dense solution region should be the solution in which all these weights take 0 (i.e., 00000). Consistently, it is easy to check that 00000 has 5 neighbors at Hamming distance 1, while all the other solutions have only one neighbor at Hamming distance 1, so 00000 has larger local entropy. However, we see that there is only one solution 00000 in which all weights take their preferred value 0, while in all the other five solutions, there is a weight that violates its preference and takes 1.

Our result suggests that during training neural networks, the learning dynamics of different weights are heterogeneous: the values of some weights can be quickly determined in the first few time steps, while most time steps are spent on determining the values of some other weights (Fig.3b,c). This result has implications in both fields of neuroscience and machine learning. During the development of embryo and infant, synapses are vastly eliminated [44]. People may wonder why the brain removes synapses that could be used to store information, long before the brain becomes functionally competent. Our result suggests that some synapses may be assigned at zero efficacy at the very early learning stage of the animal, unlikely to be changed in the subsequent learning. In this case, it is structurally and spatially economical to remove these silent synapses as soon as possible. Additionally, we presume that this heterogeneous dynamics of weights may also exist during training artificial neural networks. We can evaluate the confidence that a weight stays at a value using hidden states as in SBPI (also see Ref. [6, 7]), and fix the high-confident weights at their values in a few initial time steps, excluding them from further updating. This scheme can significantly reduce the computational overhead during training. Notably, in both biological and artificial contexts, this early-weight-fixing strategy may not compromise the generalization performance of the neural network after training, because the dense solution region lies in the subspace where these early-fixed weights take the values they are fixed at.

acknowledgments

We thank Haiping Huang for constructive comments. This work was partially supported by Hong Kong Baptist University Strategic Development Fund, RGC (Grant No. 12200217) and NSFC (Grants No.11275027).

Appendix A Implementations of SBPI and rBP

In Fig.2, the capacity of an algorithm was estimated by adding the number of input-output associations one by one, until the algorithm failed to find a solution within a given time step. Below we briefly introduce the two efficient algorithms (SBPI and rBP) we studied in this paper.

SBPI

We used the SBPI01 algorithm introduced in Ref. [18]. In this algorithm, each synapse has a hidden variable $h_{i}$ that takes odd integer values between $-K$ and $K$ . $w_{i}=1$ if $h_{i}>0$ , and $w_{i}=0$ if $h_{i}<0$ . Suppose an input-output (IO) pair $\mu$ is presented, the hidden states are updated in the following two cases. Firstly, when this IO pair is unassociated, $h_{i}$ is updated as $h_{i}\leftarrow h_{i}+2\xi_{i}^{\mu}(2\sigma^{\mu}-1)$ . Secondly, when this IO pair is associated, but $\sigma^{\mu}=0$ and $0<(1-2\sigma^{\mu})(\mathbf{w}\cdot\mathbf{\xi}^{\mu}-N/A)\leq 1$ , then with probability $p_{s}$ , $h_{i}\leftarrow h_{i}-2\xi_{i}^{\mu}$ . Solving was stopped when a solution was found or after going sweep of all IO pairs in random sequential order for $T_{max}=4000$ times. $p_{s}=0.3$ , $K=[21\sqrt{\frac{N}{480}}]$ , with $[x]$ being the integer nearest to $x$ .

rBP

The rBP updating rule [19] is

[TABLE]

with

[TABLE]

The values of weights $\mathbf{w}$ are determined by the sign of single-site fields

[TABLE]

so that $w_{i}=1$ if $h_{i}>0$ and $w_{i}=0$ otherwise. The message-passing equations eq.12 are iterated in a factor graph in which the $i$ th variable node (representing $w_{i}$ ) and the $\mu$ th factor node (representing the $\mu$ th IO pairs) are linked if $\xi_{i}^{\mu}=1$ . In eqs.12 and 15, $p(t)$ is a random number that takes $1$ with probability $1-(\gamma)^{t}$ and 0 otherwise. We also added damping through $f=0.05$ in eq.12, which is necessary to avoid divergence of the algorithm before finding a solution. In practice, we constrained $h_{i\rightarrow\mu}$ between $[-8,8]$ , so that if $h_{i\rightarrow\mu}>8$ (or $h_{i\rightarrow\mu}<-8$ ) by the first equation of eq.12, we set $h_{i\rightarrow\mu}=8$ (or $h_{i\rightarrow\mu}=-8$ ). We found adding this constraint can improve the performance of the algorithm. Solving was stopped when a solution was found or after $T_{max}=4000$ time steps. In Fig.2c, $\gamma=0.99$ ; in Fig.2e, at a given pair of $A$ and $f_{in}$ values, we chose the minimal time steps that rBP used to solve among when $\gamma=0.9,0.99$ and 0.999. We found that at the parameter values we chose ( $\alpha=0.2$ , $A=40$ and $N=480$ ), when $f_{in}=0.1$ , 0.2 and 0.3, rBP can successfully solve the perceptron with high probability in either the three $\gamma$ values, and it takes minimal time steps to solve when $\gamma=0.9$ ; when $f_{in}=0.4$ , however, rBP has low probability to solve the problem if $\gamma=0.9$ , and it takes less time steps to solve when $\gamma=0.99$ than when $\gamma=0.999$ (Supplementary Fig.3). Because of this, in Fig.3 and Fig.4, $\gamma=0.9$ when $f_{in}=0.1$ , 0.2 and 0.3, and $\gamma=0.99$ when $f_{in}=0.4$ . When studying the cases when $N=10000$ , 5000 or 2500 in Fig.8 and Supplementary Fig.9, we kept $\gamma=0.99$ , $f=0.002$ . Lookup tables were used to efficiently evaluate the H function in eq.12 during computation.

Appendix B Belief-propagation-guided decimation

The idea of belief-propagation-guided decimation (BPD) is to use belief propagation (BP) to evaluate the marginal probabilities of unfixed weights, and then fix the most polarized weight at its preferred value; then run BP again to fix another weight, until all weights are fixed. At a step of BPD, $\tau^{\mu}=\Theta(\sum_{i}w_{i}\xi_{i}^{\mu}-N/A)$ may become fixed at 1 or 0, whatever the values of the unfixed weights. If $\tau^{\mu}=\sigma^{\mu}$ , then the $\mu$ th IO pair is successfully associated and eliminated from the factor graph; if $\tau^{\mu}\neq\sigma^{\mu}$ , however, it means that BPD fails to find a solution of the perceptron problem and is terminated.

BP equations are

[TABLE]

where as in eq.13, $s^{\mu}=2\sigma^{\mu}-1$ and $\theta=N/A$ . These BP equations run on a factor graph where unfixed weights (represented by $i,j$ ) and IO pairs (represented by $\mu,\nu$ ) are variable and factor nodes respectively, and there is a link between the $i$ th variable node and the $\mu$ th factor node if $\xi_{i}^{\mu}=1$ . $N_{\partial\mu}^{\text{fix}}$ in eq.17 means the number of fixed weights $w_{k}$ with $\xi_{k}^{\mu}=1$ . After getting a fixed point of BP, marginal probabilities of unfixed weights can be estimated as

[TABLE]

After introducing $h_{i\rightarrow\mu}=\log\frac{p_{i\rightarrow\mu}(1)}{p_{i\rightarrow\mu}(0)}$ and $\hat{h}_{\mu\rightarrow i}=\log\frac{\hat{p}_{\mu\rightarrow i}(1)}{\hat{p}_{\mu\rightarrow i}(0)}$ , and correspondingly $p_{i\rightarrow\mu}(1)=\frac{1}{1+e^{-h_{i\rightarrow\mu}}}$ and $\hat{p}_{\mu\rightarrow i}(1)=\frac{1}{1+e^{-\hat{h}_{\mu\rightarrow i}}}$ , BP equations can be simplified as

[TABLE]

When the number $|\partial\mu|$ of unfixed variable nodes around factor node $\mu$ is large, eq.20 can be efficiently calculated using Gaussian approximation:

[TABLE]

where $a_{\mu\rightarrow i}=\sum_{j\in\partial\mu\backslash i}\frac{1}{1+e^{-h_{j\rightarrow\mu}}}$ , $\sigma_{\mu\rightarrow i}^{2}=\sum_{j\in\partial\mu\backslash i}\frac{e^{-h_{j\rightarrow\mu}}}{(1+e^{-h_{j\rightarrow\mu}})^{2}}$ .

In practice, for factor nodes with $|\partial\mu|>7$ , we used the Gaussian approximation above, while for factor nodes with $|\partial\mu|\leq 6$ , we used exact enumeration to calculate eq.20. We found that this small-degree-enumeration strategy results in better convergence of BP than pure Gaussian strategy. We also added damping to eq.19:

[TABLE]

We chose $f=0.02$ when $A=12$ ; $f=0.05$ when $A=40$ and $N=480$ ; and $f=0.002$ when $A=40$ and $N=10000$ , 5000 or 2500.

Numerically, in Fig.3 where $N=480$ , we initialized BP messages randomly at the beginning of each decimation step. We defined convergence of BP as the case when the change of BP entropy density $F_{BP}$ is smaller than $10^{-7}$ in adjacent iteration step. In the case when BP failed to converge, we estimated the iteration step that most close to the fixed point from the iteration dynamics of $F_{BP}$ using a number of heuristics, and calculated the marginal probabilities using the BP messages in that time step using eq.18. See Supplemental Material Section 4 for more details. In Fig.8 and Supplementary Fig.9 where $N=10000$ , 5000 or 2500, to speed up computation, we initialized BP messages as the BP messages obtained from the last decimation step if the last decimation step converged. We defined convergence as the case when the change of $\frac{1}{N}\sum_{i=1}^{N}(h_{i}^{t+1}-h_{i}^{t})^{2}$ in adjacent iteration steps is smaller than $6.5\times 10^{-5}$ , where $h_{i}^{t}$ is single-site BP field in the $t$ th iteration step. We found these two strategies do not give significantly different results when $N=480$ .

Appendix C Evaluating cross-correlation using belief propagation

The cross-correlation between $w_{i}$ and $w_{j}=0,1$ can be calculated using

[TABLE]

where $p(w_{i}=1)$ and $p(w_{j}=1)$ can be evaluated using BP through eq.18. To evaluate $p(w_{j}=1|w_{i}=1)$ , we fixed $w_{i}=1$ , run BP, got the fixed point, and calculated the marginal probabilities of the other weights.

In Fig.5b-d and Supplementary Fig.7, we evaluated mean cross-correlation and mean square cross-correlation using

[TABLE]

where $[\cdot]$ represents quenched average (i.e., average over IO pairs $\{\xi_{i}^{\mu},\sigma^{\mu}\}$ ), $\mathcal{A}$ is the set of values of $i$ , and $N_{\text{unfixed}}$ is the number of unfixed weights during BPD. When $N_{\text{unfixed}}\leq 50$ , $\mathcal{A}$ contains all the unfixed weights; when $N_{\text{unfixed}}>50$ , $\mathcal{A}$ contains 50 randomly chosen unfixed weights.

If BP for evaluating unconditioned marginal probabilities (i.e., $p(w_{i}=1),p(w_{j}=1)$ in eq.23) did not converge, we did not continue to evaluate conditional probabilities, and excluded this case from quenched average. Sometimes, BP for unconditioned probabilities converged, but BP for conditional probabilities $p(w_{j}=1|w_{i}=1)$ did not converge. We found that this usually happened when $p(w_{i}=1)$ was small, so the ill-convergence of BP when we fixed $w_{i}=1$ may reflect the unlikelihood that $w_{i}$ takes 1, or in other words, $w_{i}=0$ in most solutions. Therefore, we set $c_{ij}=0$ for all $j\neq i$ in this case. We also tried the operation which excludes from $\mathcal{A}$ the $i$ s whose fixation to 1 lead to non-convergence of BP, and found that the results were not significantly different.

Appendix D Evaluating local entropy using belief propagation

We followed Ref.[40] to evaluate the local entropy around a given weight configuration $\tilde{\mathbf{w}}$ using belief propagation. The partition function we calculated is

[TABLE]

with $x$ being a coupling factor controlling the distance of $\mathbf{w}$ from $\tilde{\mathbf{w}}$ .

Appendix E Replica-method analysis

The replica methods we used in Fig.2a-c and Fig.4a,b closely follows the standard approaches presented in Ref. [36, 11, 20]. Here we only list out the free entropy density used to calculate theoretical capacity, local entropy around typical solutions and local entropy around configurations in the dense solution region. Details how to introduce replicas to calculate the quenched average of free entropy density are seen in Ref. [36, 11, 20].

When calculating theoretical capacity, we introduce partition function

[TABLE]

with

[TABLE]

$Z(\xi,s)$ counts the number of solutions given a set of IO pairs $(\xi,s)$ . Using replica method, we can calculate free entropy density

[TABLE]

with $\langle\cdot\rangle_{\xi,s}$ indicating quenched average of sets of IO pairs. Theoretical capacity plotted in Fig.2a-c is the $\alpha$ value at which $F=0$ . During replica calculation, an order parameter $M$ is defined as $M=\frac{1}{\sqrt{N}}\sum_{i}w_{i}-\frac{\sqrt{N}}{Af_{in}}$ . This means that in a well trained perceptron, the average total synaptic current $\langle\mathbf{w}\cdot\xi^{\mu}\rangle_{\mu}=\sum_{i}w_{i}f_{in}$ deviates from the firing threshold $N/A$ up to $\mathcal{O}(\sqrt{N})$ order. In the classification task we considered, output $\sigma^{\mu}$ has equal probability to be 0 and 1. In this case, $M$ has saddle point value 0, so $\sum_{i}w_{i}f_{in}=N/A$ . The expression $\sqrt{f_{in}(1-f_{in})\sum_{i}(Aw_{i})^{2}}\approx\sqrt{AN(1-f_{in})}$ in Section 2 can be derived using the facts that $w_{i}=\{0,1\}$ and $\sum_{i}Aw_{i}\approx N/f_{in}$ .

Local entropy around typical solution plotted in Fig.4a has the name Franz-Parisi potential at zero temperature. It is defined as

[TABLE]

where $\mathcal{N}_{\xi,s}(\tilde{\mathbf{w}},D)=\sum_{\mathbf{w}}\mathbb{X}_{\xi,s}(\mathbf{w})\delta[\frac{1}{N}\sum_{i}(w_{i}-\tilde{w}_{i})^{2}-D]$ means the number of solutions at distance $D=\frac{1}{N}\sum_{i}(w_{i}-\tilde{w}_{i})^{2}$ from reference configuration $\tilde{\mathbf{w}}$ . The meaning of $F_{FP}(D)$ is the mean free entropy density at distance $D$ from a solution.

The local entropy around a weight configuration in the dense solution region is calculated using large-deviation free entropy density

[TABLE]

In Fig.4b, we calculated the value of $F_{LD}(y,D)$ in the limit $y\rightarrow\infty$ under 1RSB ansatz, following the replica method introduced in Ref.[20].

Appendix F The geometry of pure states in full solution space

In this section, we discuss the geometry of pure states in the full solution space before weight fixing, based on the replica method introduced in Ref. [39].

The structure of solution pure states is closely related to 1RSB generalized free entropy density:

[TABLE]

where $Z_{\gamma}$ is the solution number in the $\gamma$ th pure state, and $x$ is Parisi parameter. Following similar argument as Ref. [39], it can be shown that $g(x)$ is a piecewise linear function:

[TABLE]

where $\phi_{\text{1RSB1}}(n,x)$ and $\phi_{\text{1RSB2}}(n,x)$ are two saddle-point solutions of the generating function $\phi(n)=\frac{1}{N}\log[Z^{n}]$ of partition function $Z$ under 1RSB ansatz, and $\phi_{\text{RS1}}(n)$ is a solution under RS ansatz, with the relation $\phi_{\text{1RSB1}}(n,x)=\phi_{\text{RS1}}(n/x)$ and $\phi_{\text{1RSB2}}(n,x)=\phi_{\text{RS1}}(n)$ . The overlaps $q_{\text{same}}$ and $q_{\text{diff}}$ in $\phi_{\text{1RSB1}}(n,x)$ and $\phi_{\text{1RSB2}}(n,x)$ respectively take the following values:

[TABLE]

where $q$ is the saddle-point value of overlap in RS solution $\phi_{\text{RS1}}(n)$ when $n\rightarrow 0$ .

Now let us discuss the implication about solution-space structure from these results. Firstly, from eq.32, it can be shown [39] that the complexity function $\Sigma(s)$ (which is the entropy density of pure states that contain $e^{Ns}$ number of solutions) is a line connecting $(0,\phi_{\text{RS1}}^{\prime}(0))$ and $(\phi_{\text{RS1}}^{\prime}(0),0)$ (where we have denoted $\phi_{\text{RS1}}^{\prime}(0)=\frac{\mathrm{d}\phi_{\text{RS1}}(n)}{\mathrm{d}n}|_{n=0}$ ). As discussed in Ref. [39], this linear function may be a convex hull of a convex-downward function connecting these two points, whose convex-downward part between these two points is undetectable by mean-field theory. This suggests that the solution space is dominated by $\mathcal{O}(\exp(N\phi_{\text{RS1}}^{\prime}(0)))$ number of small pure states with sub-exponential number of solutions in each, and by sub-exponential number of large clusters with $\mathcal{O}(\exp(N\phi_{\text{RS1}}^{\prime}(0)))$ number of solutions in each. Additionally, the slope of function $\Sigma(s)$ at $\Sigma=0$ is $-1$ , which means that Parisi parameter in real systems $x_{*}=1$ (see eq.9) when $N$ is large. Note that according to the results above, the numbers of solutions in large and small pure states are the same to the leading exponential order (i.e., $\mathcal{O}(\exp(N\phi_{\text{RS1}}^{\prime}(0))$ ). However, the result $x_{*}=1$ means that the probability of two random chosen solutions lie in the same pure state is 0, which implies that the number of solutions in small pure states dominates over that in the large ones in sub-exponential order. Using exact enumeration, we show that $x_{*}$ tends to 1 when $N$ gets large (Fig.F1a), in support of the result $x_{*}=1$ in the large $N$ limit.

Now let’s discuss the overlaps $q_{\text{same}}$ and $q_{\text{diff}}$ . First, let’s investigate the pure states that dominate in the summation of eq.31:

[TABLE]

where $N_{\text{small}}$ and $N_{\text{large}}$ are the numbers of small and large pure states respectively, and $Z_{\text{small}}$ and $Z_{\text{large}}$ are the numbers of solutions in a small and large pure state respectively. From our previous discussion, $N_{\text{small}}\sim Z_{\text{large}}\sim\mathcal{O}(\exp(N\phi_{\text{RS1}}^{\prime}(0))$ , while both $Z_{\text{small}}$ and $N_{\text{large}}$ are of sub-exponential order. When $x<1$ , small pure states (i.e., the term $N_{\text{small}}Z_{\text{small}}^{x}$ ) dominate in $\sum_{\gamma}Z_{\gamma}^{x}$ . Because $g(x)$ uses $\phi_{\text{1RSB1}}$ to calculate when $x<1$ (eq.32), the overlaps $q_{\text{same}}$ and $q_{\text{diff}}$ of $\phi_{\text{1RSB1}}$ (eq.33) should correspond to the structure of small clusters. In other words, two different solutions in the same small pure state have overlap $1/(f_{in}A)$ , and two solutions in different small pure states have overlap $q$ . Similarly, when $x>1$ , large pure states (i.e., the term $N_{\text{small}}Z_{\text{large}}^{x}$ ) dominate in $\sum_{\gamma}Z_{\gamma}^{x}$ , so $q_{\text{same}}$ and $q_{\text{diff}}$ of $\phi_{\text{1RSB2}}$ (eq.34) should correspond to the structure of large pure states. In other words, overlaps in the same large pure state or in different large pure states should both be $q$ .

At $x=1$ , which is the value $x_{*}$ that $x$ should take in “real" systems, both $\phi_{\text{1RSB1}}$ and $\phi_{\text{1RSB2}}$ predict $q_{\text{diff}}=q$ . There is discrepancy of $q_{\text{same}}$ predicted by $\phi_{\text{1RSB1}}$ and $\phi_{\text{1RSB2}}$ at $x=1$ . To get a reasonable $q_{\text{same}}$ value, note that the number of solution pairs in the same small pure states is of $\mathcal{O}(e^{N\phi_{\text{RS1}}^{\prime}(0)})$ order, while the number of solution pairs in the same large pure states is of $\mathcal{O}(e^{2N\phi_{\text{RS1}}^{\prime}(0)})$ order. Therefore, $q_{\text{same}}$ at $x=1$ should be dominated by the solution pairs in large pure states, which means $q_{\text{same}}=q$ . Together, $q_{\text{same}}=q_{\text{diff}}=q$ . Using exact enumeration, we show that both $q_{\text{same}}$ and $q_{\text{diff}}$ stay around $q$ , and $q_{\text{same}}-q_{\text{diff}}$ tends to 0 when $N$ gets large (Fig.F1b,c), in support of the result $q_{\text{same}}=q_{\text{diff}}=q$ in the large $N$ limit.

Appendix G Miscellaneous

The method of enumeration was used both for listing out all the solutions of small-sized systems and for calculating eq.20 when $|\partial\mu|\leq 6$ . The basic idea to speed up enumeration is to sweep all possible weight configurations in an order so that adjacent configurations have minimal Hamming distance. In practice, configurations were swept in the order of Gray code [45].

The random walk in Fig.3 and Fig.4 was done as follows. Starting from a solution, we tried to flip a chosen weight, and accepted this flip if the configuration after flipping was still a solution. Weights were chosen in random sequential order. In Fig.3 and Fig.4, we started random walk from a solution found by SBPI or rBP, and swept all weights 10000 times.

Note the choice of parameter $A$ in numeric experiments. In our model, $A$ does not scale when $N\rightarrow\infty$ . Therefore, if we set $A$ a large value, comparable to or larger than $N$ , how well the numeric results can represent the large $N$ case is questionable. In all the figures, we chose $A=40$ , or 12 when studying systems of size $N\geq 480$ , or $N=25$ , so that $A$ is well smaller than $N$ in all cases.

Multi-dimensional scaling in Fig.9e was performed using mdscale routine of MATLAB, with Sammon’s mapping criterion.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. H. O’Connor, G. M. Wittenberg, and S. S.-H. Wang, “Graded bidirectional synaptic plasticity is composed of switch-like unitary events,” Proc. Natl. Acad. Sci. U. S. A. , vol. 102, no. 27, pp. 9679–9684, 2005.
2[2] J. M. Montgomery and D. V. Madison, “Discrete synaptic states define a major mechanism of synapse plasticity,” Trends Neurosci. , vol. 27, no. 12, pp. 744–750, 2004.
3[3] J. Misra and I. Saha, “Artificial neural networks in hardware: A survey of two decades of progress,” Neurocomputing , vol. 74, pp. 239–255, 2010.
4[4] R. L. Rivest and A. L. Blum, “Training a 3-node neural network is NP-complete,” Neural Networks , vol. 5, pp. 117–127, 1992.
5[5] E. Amaldi, “On the complexity of training perceptrons,” in Artificial Neural Networks (O. Simula, T. Kohonen, K. Makisara, and J. Kangas, eds.), pp. 55–60, Elsevier, 1991.
6[6] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” J. Mach. Learn. Res. , vol. 18, pp. 1–30, 2018.
7[7] J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio, “Recurrent neural networks with limited numerical precision,” ar Xiv:1608.06902 , 2017.
8[8] A. Engel and C. V. den Broeck, Statistical Mechanics of Learning . Cambridge, England: Cambridge University Press, 2001.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Understanding the computational difficulty of a binary-weight perceptron

Abstract

1 Introduction

2 Model

3 Computational advantage of sparse input under large AAA

4 Understanding the difficult part during perceptron solving through

5 Approaching dense solution region through decimation

6 Cross-correlation of unfixed weights in solution subspace S(Nfix)\mathcal{S}(N_{\text{fix}})S(Nfix​)

7 Understanding the weight cross-correlation in S(Nfix)\mathcal{S}(N_{\text{fix}})S(Nfix​)

8 The case when α→αctheo\alpha\rightarrow\alpha_{c}^{\text{theo}}α→αctheo​

9 The geometry of solution clusters when α→αctheo\alpha\rightarrow\alpha_{c}^{\text{theo}}α→αctheo​

10 Conclusion and Discussion

acknowledgments

Appendix A Implementations of SBPI and rBP

SBPI

rBP

Appendix B Belief-propagation-guided decimation

Appendix C Evaluating cross-correlation using belief propagation

Appendix D Evaluating local entropy using belief propagation

Appendix E Replica-method analysis

Appendix F The geometry of pure states in full solution space

Appendix G Miscellaneous

3 Computational advantage of sparse input under large $A$

6 Cross-correlation of unfixed weights in solution subspace $\mathcal{S}(N_{\text{fix}})$

7 Understanding the weight cross-correlation in $\mathcal{S}(N_{\text{fix}})$

8 The case when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$

9 The geometry of solution clusters when $\alpha\rightarrow\alpha_{c}^{\text{theo}}$