Autonomous Probabilistic Coprocessing with Petaflips per Second
Brian Sutton, Rafatul Faria, Lakshmi A. Ghantasala, Risi Jaiswal,, Kerem Y. Camsari, Supriyo Datta

TL;DR
This paper proposes a novel autonomous probabilistic computer architecture based on a network of p-bits, capable of ultrafast operation without sequencers, and projects petaflips per second scalability with hardware benchmarks.
Contribution
It introduces a sequencerless design for probabilistic computers using p-bits, enabling ultrafast, autonomous operation and hardware-agnostic performance metrics.
Findings
Demonstrates ultrafast autonomous p-bit operation
Projects petaflips per second scalability with hardware benchmarks
Proposes flips per second as a universal hardware performance metric
Abstract
In this paper we present a concrete design for a probabilistic (p-) computer based on a network of p-bits, robust classical entities fluctuating between -1 and +1, with probabilities that are controlled through an input constructed from the outputs of other p-bits. The architecture of this probabilistic computer is similar to a stochastic neural network with the p-bit playing the role of a binary stochastic neuron, but with one key difference: there is no sequencer used to enforce an ordering of p-bit updates, as is typically required. Instead, we explore \textit{sequencerless} designs where all p-bits are allowed to flip autonomously and demonstrate that such designs can allow ultrafast operation unconstrained by available clock speeds without compromising the solution's fidelity. Based on experimental results from a hardware benchmark of the autonomous design and benchmarked device…
| Sequenced | Autonomous (This Work) | ||||
| Hitachia | Janus IIb | 2K QAc | 8K Isingc | Projectedd | |
| Technology | CMOS (SRAM) | CMOS (FPGA) | CMOS (FPGA) | CMOS (FPGA) | CMOS + MTJ |
| Total Power (W) | 0.05 | 25 | 55 | 32 | 19.25 |
| Number of Neurons (N) | 20K () | 2K | 2K () | 8.1K () | 1M |
| 1/12 | 1/4 | 1/10 | |||
| Synapse Delay () (ps) | 10,000 | 4,000 | 8,000 | 8,000 | 10 |
| Neuron Delay () (ps) | 10,000 | 4,000 | 96,000 | 32,000 | 100 |
| Flips per Second () | |||||
| Spin Update Time () (ps/flip) | 1 | 4 | 48 | 4 | 0.0001 |
| Energy per Flip () (nJ/flip) | 0.1 | 2.64 | 0.13 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Autonomous Probabilistic Coprocessing with Petaflips per Second
Brian Sutton, Rafatul Faria, Lakshmi A. Ghantasala, Risi Jaiswal, Kerem Y. Camsari, and Supriyo Datta BS, RF, LAG, RJ and SD are with the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, 47907, USA, KYC is with the Department of Electrical and Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA, 93106, USA. This work was supported in part by ASCENT, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA and in part by the Center for Probabilistic Spin Logic for Low-Energy Boolean and Non-Boolean Computing (CAPSL), one of the Nanoelectronic Computing Research (nCORE) Centers, a Semiconductor Research Corporation (SRC) program sponsored by the NSF.
Abstract
In this paper we present a concrete design for a probabilistic (p-) computer based on a network of p-bits, robust classical entities fluctuating between -1 and +1, with probabilities that are controlled through an input constructed from the outputs of other p-bits. The architecture of this probabilistic computer is similar to a stochastic neural network with the p-bit playing the role of a binary stochastic neuron, but with one key difference: there is no sequencer used to enforce an ordering of p-bit updates, as is typically required. Instead, we explore sequencerless designs where all p-bits are allowed to flip autonomously and demonstrate that such designs can allow ultrafast operation unconstrained by available clock speeds without compromising the solution’s fidelity. Based on experimental results from a hardware benchmark of the autonomous design and benchmarked device models, we project that a nanomagnetic implementation can scale to achieve petaflips per second with millions of neurons. A key contribution of this paper is the focus on a hardware metric flips per second as a problem and substrate-independent figure-of-merit for an emerging class of hardware annealers known as Ising Machines. Much like the shrinking feature sizes of transistors that have continually driven Moore’s Law, we believe that flips per second can be continually improved in later technology generations of a wide class of probabilistic, domain specific hardware.
I Introduction
Stochastic artificial neural networks (ANN) have broad utility in optimization and machine learning (ML) tasks such as inference and learning[1]. Even though stochastic ANNs are relatively rare in modern ML architectures [1], stochasticity is often viewed as a useful resource [2, 3, 4]. A class of emerging hardware accelerators, known as Ising Machines, typically have stochastic neural network representations. Ising Machines designed to solve hard problems in combinatorial optimization continue to emerge using a wide-range of underlying technologies. Solvers for such problems have been explored using quantum effects, optical approaches, digital logic, and magnetic technologies [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]. In general, these systems map a given optimization problem onto a hardware whose operation is guided by a cost function[19, 20].
A common version of such stochastic neural networks is based on the concept of a binary stochastic neuron (BSN) [21, 22] which fluctuates between -1 and +1 with probabilities that can be controlled through an input, , constructed from the outputs of other BSNs, . The synaptic function, , can have many different forms depending on the desired functionality, but we will restrict our discussion to linear functions defined by a set of weights such that
[TABLE]
where is a constant and is the ‘synapse time’, that is the time it takes to recompute the inputs every time the outputs change. In software implementations, each BSN is updated repeatedly according to
[TABLE]
where represents a random number in the range , and is the ‘neuron’ time, that is the time it takes for a neuron to provide stochastic output with the correct statistics dictated by a new input .
It is well-known [23] that to ensure fidelity of operation it is important to avoid simultaneous updates of two BSNs that are connected through a non-zero . The standard approach to avoid this issue is to update each BSN sequentially according to Eq. (2), recomputing the input from Eq. (1) after each update, a procedure known as Gibbs sampling[24]. With sequential updating, the p-bit update rate, given as the number of flips per second (), is limited by the clock speed of a given implementation. While it is possible to optimize such an approach with high-speed hardware, the sequential nature of the update limits is limiting. It is possible to enable parallel updates for groups of neurons if an encoded problem can be partitioned into decoupled groups of neurons sequenced to update independently. However, this partitioning is still constrained by the need to avoid simultaneous updates while also requiring problem specific analysis to determine the neuron update groupings. Existing hardware approaches have used problem specific partitioning to improve and we will compare our approached with these implementations in section V.
In contrast with sequenced BSN update approaches, the objective of this paper is to explore the feasibility of ultrafast operation through an autonomous architecture whereby each BSN continually fluctuates between -1 and +1 with probabilities that are controlled by the input . We refer to this autonomous BSN as a p-bit [25] to highlight its role as the key element of an autonomous p-computer (ApC), similar to the role of a q-bit in a quantum computer. We note that such an autonomous architecture in the absence of any clocking circuitry that controls the updating p-bits has recently been demonstrated in small scale using 8 magnetic tunnel junction based p-bits[26]. With this experimental demonstration of an 8 p-bit design, it is important to understand if such a system can scale effectively.
Herein we use FPGA emulation to demonstrate the operation of a scaled version of such an autonomous computer up to 8100 (9090) p-bits with all the necessary peripheral circuitry, including programmable synapses that are used to map different problems to the co-processor. The FPGA implementation presented in this paper is specifically designed to capture the autonomous operation of probabilistic bits that fluctuate in time allowing us to make performance projections of a scaled implementation of the demonstration presented in Ref. [26]. This paper demonstrates the feasibility of an ApC that performs the weight logic and p-bit functions defined by Eqs. (1) and (2) without the aid of sequencers as portrayed in Figs. 1 and 1. Our work is motivated by the compact, fast, energy efficient hardware that are currently being developed for the implementation of these functions [27] as shown in Fig. 1, which we emulate using existing CMOS devices on an easily reconfigurable, cloud accessible digital FPGA platform, Fig. 1.
In the following sections, we will present an emulation framework for the study of scaled autonomous probabilistic coprocessing and use the framework to provide performance predictions of a design based on nanomagnets, helping to motivate such an implementation. Section II will provide an overview of the ApC, our analysis methodology, and present an abstract autonomous p-bit model with corresponding device benchmarking. Section III provides an overview of the design and implementation of a p-computing coprocessing framework using a cloud-accessible FPGA platform. Section IV presents results for two distinct applications using the coprocessor, one involving combinatorial optimization and one involving emulated quantum annealing. Finally, sections V and VI discuss how these applications help show that accurate results can be obtained with a sequencerless probabilistic computer of the type envisioned by Feynman [28], implemented using modern devices to enable operation at ultrafast rates unconstrained by the available clock speed in a sequenced design.
II Autonomous p-Computing Model
The building block for our ApC has four components as shown in Fig. 1:
- •
weight logic to implement Eq. (1),
- •
p-bit to implement Eq. (5) described below,
- •
write unit to program the weights and
- •
read unit to access the individual p-bit outputs
Fig. 1 shows how multiple building blocks can be interconnected to form a p-computer. The tiling shown is based on nearest-neighbor connections, but the connections need not be limited to nearest-neighbor. We have also implemented all-to-all networks using the digital emulator shown in Fig. 1, as discussed in section III. In the following sub-sections, we will introduce what is meant by “autonomous” operation, present a digital model for such an autonomous p-bit, and finally benchmark the digital model against a physical device model.
II-A Autonomous p-bit Operation
Superficially Fig. 1 looks like other existing neural network architectures, like the one used for TrueNorth [29]. However, to our knowledge, earlier implementations have used time-multiplexing [29, 30] to share the same resource among different neurons and synapses, while our objective is to eliminate sequencing and time-multiplexing altogether so that we are not constrained by available clock speeds. TrueNorth for example uses 4096 neurosynaptic cores, each core having dedicated neuron and synaptic memory forming 256 logical neurons that are time multiplexed sequentially to implement M logical neurons [29]. For our sequencerless operation we would need 1M distinct building blocks for the same number of neurons. This would be impractical if we were relying on fully digital implementations, however the compact hardware implementations currently being developed makes such a design feasible. As an example, stochastic behavior of nanomagnets has recently attracted attention in the context of novel computing paradigms, and they show promise in probabilistic and neuromorphic applications [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]. For example, an MRAM-based p-bit requires only 3 transistors and a magnetic tunnel junction [46], while its digital emulation requires significantly more transistors. With such compact hardware, it is feasible to have one building block for every p-bit in order to support sequencerless operation that is not limited by clock speeds.
We call this sequencerless operation of ANNs “autonomous” to distinguish it from the asynchronous operation that is widely used in the context of Spiking Neural Networks [29, 30].
As a quantitative measure of an ApC’s speed of operation we use the number of flips per second (f), a flip being defined as a p-bit update attempt (i.e. it may choose not to actually flip). For purely sequential updating, the number of flips per second is . However, as mentioned earlier, updating need not be purely sequential since unconnected BSNs can be simultaneously updated without loss of fidelity. If a number () out of the total number () of BSNs can be updated in parallel, then the number of flips per second will be much larger . Note, however, that in order to achieve this enhanced flip rate, the number of neurons that are simultaneously updated, , have to be deliberately selected using a digital sequencer, so that the clock period, , limits the maximum number of flips per second:
[TABLE]
The objective of this paper is to present a framework for clockless operation [23] whose speed is limited only by the neuron and synapse speeds
[TABLE]
where . We will show that this autonomous mode provides high fidelity results without supervision keeping the fraction of detrimental simultaneous updates down to an acceptably low. Detrimental updates are managed simply by choosing a small so that is much longer than , without using a digital sequencer to enforce a deliberate update order. As a result, is not limited by and can continue to operate faster as the synapse time is lowered. For example if we have nearest neighbor connections with weights of , then the synapse can be implemented with short wires which respond in times less than 10 to 100 ps [48], much shorter than typical clock periods.
Eqs. (3a) and (3b) suggest that an autonomous design will allow faster operation (that is, more flips per second) if
[TABLE]
The factor represents the fraction of neurons that can be updated simultaneously, which depends on the fan-in, the nature of interconnections, and problem topology. For example, with nearest neighbor connections on a 2D square lattice as in Fig. 1, half the nodes can be updated simultaneously so that . The factor also depends on the interconnections, but it additionally depends on the nature of the problem and the degree of solution fidelity needed, as we will show in this paper.
Ideally, we would implement a scaled version of an ApC using nanomagnetic devices to explore the performance of such a design. However, given current technological limitations, we will model the operation of p-bit hardware using a digital FPGA platform. In the following section we discuss how we use a digital, synchronous device to emulate intrinsically asynchronous nanomagnets.
II-B Autonomous p-bit Model
As digital platforms are inherently synchronous, we mimic autonomous operation by replacing Eq. (2) with a new hardware-inspired model, Eq. (5) below that we benchmarked against established device models (section II-C). These equations are based on SPICE simulations of Boltzmann networks where the update order of p-bits becomes irrelevant due to the symmetric coupling between connected p-bits. Such a clockless circuit corresponds to the asynchronous parallelism scheme used to realize Boltzmann Machines in hardware with no asymptotic guarantees for convergence [23] unless all p-bits operate with up-to-date information that is enabled by fast synapses [49]. This model is valid for such networks, however, for other networks such as those with directed connections, the update ordering of p-bits may be important and other hardware models more appropriate for these systems are likely required and are not discussed in this paper.
At each time step, all p-bits are free to flip and they do so with a probability that is controlled by the input having a zero-input value .
[TABLE]
As each p-bit flips with a probability in each time step, the average time taken for a p-bit to respond is . Since time steps are measured in units of , we have as stated earlier. Unlike Eq. (2), Eq. (5) can be used to update all p-bits in parallel without explicitly worrying about simultaneous updates. With small values of , the fraction of simultaneous updates is sufficiently small such that Eq. (5) in an unsequenced mode gives results equivalent to those obtained from careful sequencing using Eq. (2).
For a given network topology and embedded problem[50, 51], the value of that ensures convergence to thermal equilibrium must be identified in order to assess the amount of parallelism in the design. The desire is to find the smallest value of that converges the thermal equilibrium for the problem specific Boltzmann distribution. In section IV we explore two example problems and evaluate the degree of parallelism obtainable. For example, in the nearest-neighbor design of Fig. 7, a value of ensures convergence for a network of 8100 neurons. This value of states that on average the number of neurons that update within each synapse delay is . However, as the problem complexity increases, i.e. the incorporation of on-site biases, the acceptable value of decreased to as shown in Fig. 8. Ultimately, the term drives the physical design of the synapse and neuron implementation.
II-C Model Benchmarking with Stochastic LLG
The autonomous p-bit model of (5a) and (5b) was benchmarked against a coupled stochastic Landau-Lifshitz-Gilbert (sLLG) equation. Magnetization dynamics of the modeled circular stochastic nanomagnet are captured by solving the sLLG equation [52]:
[TABLE]
where is the damping coefficient, is the electron gyromagnetic ratio, Vol./ is the total number of Bohr magnetons in the magnet, is the saturation magnetization, is the effective field including the out-of-plane ( directed) demagnetization field , as well as the thermally fluctuating magnetic field due to the three dimensional uncorrelated thermal noise with zero mean and standard deviation along each direction, is the applied spin current to the nanomagnet. The HSPICE solver we employed is based on spherical coordinates that solves for (,), but the noise is first included as three uncorrelated random magnetic fields in Cartesian coordinates before being turned into spherical coordinates [53]. The HSPICE solver uses the .trannoise function[53] and is benchmarked against our own MATLAB implementation that uses the Stratonovich convention [54], as well as the Fokker-Planck Equation [52, 53]. We have used a time step of 1 ps for the chosen parameters which is verified by comparing the equilibrium fluctuations of single nanomagnets that are obtained numerically, against the expected Boltzmann distribution.
Normally, a nanomagnet based p-bit thresholds its continuous output with an inverter [25, 55] that is typically included in SPICE simulations with additional transistors. In order to simplify numerical simulations in the present context, we artificially threshold the sLLG outputs (such that and ) at each time step. This allows a binarization of the sLLG outputs that allow each p-bit to have a binary output, according to Eq. 2. Note that in an actual device implementation, a single inverter is enough to threshold the output given that a moderate tunneling magnetoresistance value (TMR=) leads to voltage fluctuations of mV’s over a voltage division of V, which is more than enough for a single inverter to threshold these fluctuations to rail-to-rail voltages [55]. For a detailed analysis of the nanomagnet based p-bit design that includes device-to-device variations, see Ref [26].
Individual p-bits are coupled according to:
[TABLE]
where, is the fitting parameter of the sigmoidal response ( versus spin current along -direction). Equation 7 and Equation 5 constitute the autonomous behavioral model that is used to benchmark the results of coupled sLLG equations, which we refer to as “behavioral model” hereon. In the benchmark, a circular disk magnet with a vanishing anisotropy () is used with the parameters: diameter nm and thickness nm, , emu/cc, Oe resulting in an autocorrelation time of ns and mA. A fitting parameter of is used in the behavioral model for , i.e. .
We use a simulated Sherrington-Kirkpatrick [56] spin glass network with a random coupling matrix and random bias between -1 and +1. The benchmarking of the proposed behavioral model with the coupled sLLG network, analogous to the probabilistic circuit proposed in [57], is accomplished by comparing two different quantities: (1) Euclidean distance and (2) Free energy.
Euclidean distance is defined by:
[TABLE]
where is the probability of occurrence of the -th configuration computed out of 4000 ensembles at each time step of the simulation. is computed from the joint probability distribution obtained from a Boltzmann law.
The second benchmark approach is based on a comparison of the free energy of the system with what is expected from the principles of statistical mechanics. Free energy is defined by [58] the partition function :
[TABLE]
where, is the pseudo-inverse temperature. Partition function is given by:
[TABLE]
where represents different configurations of the network. Energy of a specific configuration is defined by:
[TABLE]
When numerically calculating free energy from the sLLG data, the following steps have been applied (similar to the importance sampling method described in [59]):
The probability of different configurations, , are calculated out of 4000 ensembles for each time step 2. 2.
For each larger than a certain threshold value , the partition function is calculated, so that outliers are excluded 3. 3.
For each , the free energy is calculated. 4. 4.
Finally the mean of all is computed.
The above method is suitable for small examples, but may not scale due to the difficulty in empirically calculating different probabilities as the network size grows. The striking agreement between the sLLG model and the behavioral model given by Eq. (5) shown in Fig. 2 and Fig. 3 helps establish the validity of Eq. 5 as a suitable digital model of an autonomous, stochastic MTJ-based computer.
III Emulation Framework
Having established a model for the autonomous p-bit operation, in this section we describe the design and implementation of an FPGA based framework to explore the performance, scalability, and other characteristics of an ApC.
III-A Autonomous FPGA Coprocessor
An all-digital framework based on Eqs. (5a) and (5b) was developed, Fig. 4, to facilitate architectural exploration of an ApC, study various trade-offs in p-bit, weight logic, and topology design, and to accelerate the combinatorial optimization and sampling problems explored in section IV. The digital framework leverages reconfigurable computing devices to support rapid exploration of different designs. A Xilinx Virtex Ultrascale xcvu9p-flgb2104-2-i provided via Amazon Web Services F1 cloud-accessible EC2 compute instances was used for the ApC implementation. While a Xilinx FPGA was used in this work, the design is hardware agnostic and another device, e.g. an Intel Stratix FPGA, could readily be used.
As shown in Fig. 1 and Fig. 4, an ApC comprises multiple weighted p-bits arranged in various topologies, each supporting programmable problem instances. There are many options for the implementation of programmable control, weight logic, p-bits, and p-bit readout in a digital platform. The digital ApC of Fig. 4 comprises a modular weighted p-bit, Fig. 4, that can be organized into various topologies, Fig. 4, supporting programmed problem instances. An example weighted p-bit implementation is shown in Fig. 1 leveraging a memory-mapped weight-logic register bank supporting linear weight coupling. The output of the weight logic block is provided to a programmable, activation function look-up table that is used in conjunction with a Linear Feedback Shift Register (LFSR) based pseudo-random function to implement Eqs. (5a) and (5b). Additional options were developed and explored for these building block elements, see section III-B and Fig. 6, beyond what is shown in Fig. 1.
Interaction with the FPGA framework is provided through MATLAB MEX programs in a client-server command driven model, Fig. 4. Clients issue commands to select which of the pre-built topologies to program into the cloud FPGA instance, the current pseudo-temperature for the network, problem specific weights, and options to pause or resume the network. Commands are also provided to support random sampling from the network for readout operation. Online annealing is directly supported through global update operations of the activation function look-up. All weights are dynamically programmable through memory-mapped operations. The server interfaces with a PCI Express (PCIe) attached FPGA, interacting with the programmed design as commanded. Other clients such as Octave and Python are readily supported through a C++ abstraction layer leveraging networking and serialization libraries.
Fig. 5 depicts the logical organization of the FPGA ApC design used herein. All interaction with the FPGA is performed using a PCIe Gen3 x16 interface supporting direct memory access (DMA) transactions (requester). Programmable control of the design is accomplished using an AXI-Lite 32-bit completer interface and memory map decoder. The address space for the ApC is divided into a few regions: global control and information registers, p-bit readout array, and individual control for each weighted p-bit in the design.
The global address space provides information on the ApC pertinent for client interaction with the system. This includes information such as the total number of p-bits in the design, the p-bit network topology, weight precision, and other useful run-time information such as the number of elapsed synapse delays between client sampling requests. Additionally, global control functions include the ability to pause and resume the network so that a client can access the global p-bit readout array. This readout array holds the current output of all p-bits sampled on each system clock cycle. Reads from this interface are performed when the network is paused to ensure atomic readout of all p-bits given the limited bandwidth of the readout interface.
Each p-bit in the system has a local memory space supporting programmable control of its function. Each p-bit has localized programmable weights enabled through registers or internal memory (RAM), a programmable activation function lookup table supporting direct look-up or interpolation, support for seeding the chosen pseudo random number generation (PRNG) function, and finally a set of control registers for the p-bit. The output of the p-bit programmable elements directly interface with the weight logic, activation function, PRNG, and comparator operation of the weighted p-bits. Online annealing is accomplished using a global bus broadcast when programming the activation function lookup tables, so that all p-bits are updated simultaneously. Alternatively, each p-bit’s activation lookup table can be independently controlled, allowing the exploration of non-uniform bias, local temperature effects, and other non-idealities, facilitating future opportunities for exploration.
III-B Digital Building Blocks
A modular digital p-bit was designed to support different options for the p-bit building block elements. Each p-bit is logically partitioned into a unit for pseudo-randomness or “entropy”, a block for computing the activation function, and a portion that uses the results of a comparison between the activation function and PRNG to determine if a p-bit update attempt should occur. There are various ways to construct these elements using digital logic. In this design, a few select implementations were explored as shown in Fig. 6. A non-exhaustive exploration of design options described below was performed as part of initial trade-study. As quantitative results were not pursued, a qualitative assessment of the various options is provided for completeness.
Fig 6 and 6 show two methods for performing autonomous updates of each p-bit. Shown in Fig. 6 is the logic corresponding to Eq. (5a) where the comparator provides . This approach emulates autonomy while preserving fully synchronous operation of the digital design. This p-bit update logic was used for the problems in this work. The activation look-up is programmed with .
A single clock domain within the FPGA is used to synchronize the digital elements of each synapse and p-bit. For each global clock period, , the synapse logic requires time to compute where is the number of global clock events required. In the nearest-neighbor models, it is possible for to need only a single cycle, however, as the the weight precision and number of neighbors is increased, also increases. Thus, when specifying an ratio, coupled with a design driven , the neuron time, encoded probabilistically in the look-up table, is .
Alternative approaches were explored and implemented for the p-bit updates including the use of free-running ring oscillators, Fig. 6, to mimic the naturally stochastic update frequency of an MTJ based p-bit as described by Eq. (2). In this implementation, each p-bit has a dedicated free running ring oscillator that generates asynchronous “attempt” edges that are synchronized into a system clock domain. Each asynchronous edge determines when the p-bit should attempt to update based on the current output from a comparator according to Eq. (2), in which case the activation look-up is programmed with . While the use of oscillators provides some degree of true randomness for when flip attempts occur, this benefit was marginal and did not out weigh the design complexity and overhead (e.g. area and power) introduced with their use. As a result, the design of Fig. 6 was selected for the attempt logic.
While the attempt logic is used to determine when an update attempt should be made, the logic relies on the output of a comparator to determine if a flip should occur. The comparator computes the sign of the difference between the output of a PRNG and the output of the programmed activation function. Shown in the second row of Fig. 6 are two PRNG implementations. A Linear Feedback Shift Register (LFSR) is a PRNG that provides pseudo randomness in a compact design at the expense of output quality, Fig. 6. While the LFSR may be sufficient for many problems, a higher-quality PRNG was implemented that requires minimal FPGA resources, but significantly improves the PRNG quality [60].
The second input to the comparator is from an activation function output, shown on the third row if Fig. 6. As implemented, a straight-forward look-up table was sufficient for the designs in this paper. However, an improved interpolation based activation function[61] would improve the accuracy of the lookup results and may be necessary for certain problem classes.
An additional building block was created to leverage physical randomness and a built-in sigmoidal response from within a digital design as shown in Fig. 6. Leveraging a delay based building block [62] and flip-flop metastability, “true” entropy was used to construct a sigmoidal response as depicted in Fig. 6. However, over continued operation within the device, temperature and other variations caused the sigmoidal curves to drift, resulting in non-uniform bias and operation. As a result, this building block is not currently being used in the design; however, it does provide insight into non-idealities that may be encountered in a chip design.
Finally, we note that an ApC that emulates the physics of different hardware primitives including memristive [63] stochastic neurons could be implemented in an autonomous circuit, unlike the nanomagnet inspired equations Eq. 5a-Eq. 5b. Additionally, as discussed earlier, we do not explore directed networks in this work, however the overall FPGA architecture was designed to support the exploration of different hardware models and network topologies, hence they can be included here with minimal effort in the future.
IV Applications and Validation
In this section we explore two applications of the ApC using the FPGA framework: combinatorial optimization and quantum emulation.
IV-A Combinatorial Optimization
A common architecture used to solve combinatorial optimization problems is the nearest-neighbor Ising model as shown in Fig. 7. In the Ising model, each p-bit is connected to its neighbor through a coupling matrix, , and is influenced by an on-site bias . Note that this is trivially mapped into as in Eq. (1). By mapping a problem of interest to this system, an Ising computer will intrinsically search for the lowest energy solution to the problem.
Typical implementations of these systems leverage some form of simulated annealing in hardware to guide the system into a low energy solution, using careful control and sequencing of spin-flip updates to avoid non-ideal spin updates [64, 6, 65]. In the context of a nearest-neighbor topology, the network can be split into two groups in a checkerboard-like pattern such that all spins in a given group can be updated in parallel [66]. This results in the ability to update half of all spins within a given clock period. According to Eq. (3a) this results in
[TABLE]
An ApC can be used to solve the same class of problems, but to do so a value of must be found that produces a solution with the desired fidelity. By identifying this limit for , an upper bound is placed on the synapse speed that must be obtained to achieve a competitive from Eq. (4).
Shown in Fig. 7 is a Max-Cut problem for which a black and white image was used to encode magnetic domains for the p-bits. The network is initialized to run at a high effective-temperature (low ) resulting in an effectively uncoupled network. The temperature is then gradually reduced according to an annealing scheduling until the network crystallizes at a low energy, in this case ideal, solution as shown in Fig. 7. Shown in Figs. 7, 7, 7, different values of are used to convey how simultaneous updating affects the ability of the network to converge to the solution. As is decreased from to and , the smaller values of result in a more effective convergence. While these results are heuristic in nature given their visual display, a quantitative energy based analysis conveys the same relationship as discussed in the following section with a demonstration of simulated quantum annealing.
For this problem, a value of resulted in effective convergence to the ideal solution and well-behaved network operation. Given this result, the ApC has
[TABLE]
Comparing (12) and (13), as long as the synapse is twice as fast as the best that could be obtained from a synchronous implementation, the network will perform the same number of flips per second without needing a sequencer.
IV-B Quantum Emulation
Simulating the behavior of quantum systems using classical models has long attracted interest. After its introduction by Ref. [67], it has been recognized that thermodynamic features of quantum systems that avoid the “sign problem” [68] can be efficiently simulated by classical computers. This allows Quantum Annealing (QA) algorithms designed for quantum systems to be simulated on classical computers, an approach that is called Simulated Quantum Annealing (SQA). Computationally, SQA uses a finite number of “replicas” where each replica is sized to match the size of the original quantum system. This replication enables a mapping of the quantum system to a classical collection of p-bits. The number of replicas that are needed depends on the temperature of the quantum system [69, 70] and the desired accuracy to emulate the quantum system. SQA algorithms are typically run on software [69, 70, 71] and on sequencer-based hardware designs [72, 73]. Here, we demonstrate how quantum systems can be emulated using our ApC by solving a model quantum system, the Transverse Ising Hamiltonian, and establish how ApC exactly reproduces the thermodynamics of a many-body quantum system at finite temperatures and magnetic fields.
Recently, it was theoretically argued [74] that a network of p-bits described by Eq. (1) and interconnected according to Eq. (2) can be used to emulate a quantum system with a finite number of classical replicas, where a p-bit is represented in hardware by the stochastic MTJ-based implementation of Fig. 1.
We start from the nearest neighbor Transverse Ising Hamiltonian in 1D [75]:
[TABLE]
where represents the interaction between neighboring spins, and are local magnetic fields in the and directions, is the Pauli spin operator, and is the number of spins in the chain. After a Suzuki-Trotter mapping, the 2D classical Hamiltonian that approximates this system with replicas is given as:
[TABLE]
where the parallel coupling is , and the vertical coupling term is and represent the classical spins in the 2D lattice.
Eq. (15) can be used to find each diagonal element of the quantum density matrix and therefore, all diagonal operators, and their correlations can be calculated from it. The corresponding interaction matrix () to perform a p-bit simulation can be calculated from the mapped classical Hamiltonian, such that to yield the weight coefficients. Note that the Suzuki-Trotter decomposition adds another dimension to the classical system, therefore a 1D linear chain for a quantum system is emulated by a 2D classical system.
In Fig. 8, we simulate a ferromagnetic linear chain () using spins with periodic boundary conditions at different transverse magnetic fields and we compute the average magnetization, . We include a symmetry breaking field in the z direction ( to obtain a net magnetization at vanishing transverse fields (. We obtain the exact density matrix by directly diagonalizing the quantum Hamiltonian (Eq. (15)): , where we chose an inverse temperature of . Once is known, we compute the average magnetization by tracing it with the magnetization operator , . The exact solution is shown as a black solid line in Fig. 8.
The corresponding average magnetization is obtained by running the classical system that is described by Eq. (15) with replicas (250 8 = spins) using Eq. (5) in the FPGA emulator. Fig. 8 shows different values that are used to obtain the exact result. Clearly, choosing too high () fails, but gradually decreasing allows the result to approach the exact result. We observe that seems to be an optimal choice, as decreasing further does not yield more improvement due to the chosen precision in the digital implementation. Specifically, as continues to reduce, the chosen precision of the arithmetic logic and look-up tables in the FPGA emulator design limits the accuracy of the solution. Fig. 8 quantitatively shows a boxplot of the error incurred at each value across 200 trials. In the the Suzuki-Trotter decomposition, increasing systematically increases the error but a reasonably close agreement is observed for .
Typically, SQA algorithms initialize the system at a high magnetic field ( and slowly remove it to keep the system in its ground state and guide it to the desired ground state of the Ising Hamiltonian. In Fig. 8, however, we have not changed the magnetic field as a function of time, but rather sampled from 200 ensembles, each separated by synapse delays, to obtain the system statistics. This means that not only was the system guided to the expected ground state (), but it also followed the correct average magnetization at high magnetic fields. As such, this could be viewed as an example of sampling a probability distribution rather than finding the ground state of the system [21], an important problem space where an ApC could be useful.
It is important to note that to solve optimization problems using SQA, an exact mapping of the quantum system may not be optimal and a finite number of replicas that only approximate the thermodynamics of the quantum system could lead to more efficient results. In the present context, optimizing the number of replicas for a given system can be arranged since our system is not a natural quantum system nor is it trying to faithfully reproduce the behavior of such a system. Therefore, optimizing the number of physical replicas or finding the replica with the lowest energy becomes possible in our engineered system [70].
The replica discretization for the present implementation of SQA incurs errors of order O() where is the inverse pseudo-temperature and is the number of replicas [70]. The explicit form of the error provides a guide to reduce errors at a fixed at the expense of adding more p-bits. As an example, note the increasing error as in Fig. 8 where increasing compared to a fixed J is like increasing . In both examples (Fig. 8-9) we discuss, we have made comparisons to exact calculations that did not require a careful optimization of the number of replicas since we were guided by the errors with respect to exact calculations.
As a second example, we illustrate the trade-off between the number of replicas and the size of the quantum system when limited to a fixed number of p-bits, for example in an FPGA with finite resources. The emulation in Fig. 8 modeled a 1D Transverse Field Ising Model (TFIM) with spins and with 250 replicas (with a total of 2000 p-bits) to obtain an accurate result at a low (dimensionless) temperature of . In Fig. 9, we show another example with a much larger system of spins but using only 10 replicas per quantum spin for a total of 2500 p-bits.
In this example, instead of plotting average magnetization, we plot the average correlation (i.e, between lattice points at a fixed inverse temperature () and a given magnetic field . We compare the FPGA results with an exact calculation that is known for the 1D TFIM model at a finite magnetic field. Unlike the classical Ising model where such correlations can be easily computed, the exact calculation for the corresponding quantum problem is quite involved and we omit the details here. In short, the correlation for a finite sized 1D lattice at a given and can be computed exactly by constructing a Toeplitz determinant, , where represents the correlation value at a maximum distance from a reference lattice point that is arbitrarily chosen to be in the present example (See Chapter 10, Equation (10.58) and the preceding discussion in Ref.[76]). The results that we obtain from this exact method and the FPGA sampling are shown in Fig. 9 where we observe excellent agreement between the ApC results and the exact calculation even for a relatively low number of replicas () when the value is less than or equal to which corresponds to approximately simultaneous updates. When the parameter exceeds this value, we observe systematic oscillations and large errors in the average correlation arising from a large degree of parallel updates, reminiscent of similar oscillations that are due to simultaneous parallel updating in Boltzmann networks[77].
V Discussion
These example applications help demonstrate the feasibility of an ApC governed by Eq. (5) to follow Boltzmann statistics without the need for a controlling sequencer. But in order to make use of ultrafast flip rates, f, enabled by short times, it is important that the building blocks be energy efficient to ensure that power levels are acceptable:
[TABLE]
where is the power budget and is the energy required per flip.
In Table I, we compare representative Ising computers based on sequenced designs [6, 64] to the autonomous designs implemented in and projected by this work, focusing on nearest-neighbor implementations for this discussion. The last row of the table shows for both the sequenced and autonomous designs. As shown in the table, the FPGA based 8K-spin ApC achieves an energy per flip that is similar to the Janus II sequenced FPGA implementation. However, this comparison applies some artificial constraints on the Janus II design: namely that all spins must be simultaneously resident in the device at one time. This is a requirement for unclocked autonomous designs that for the sake of comparison we applied to the sequenced design. A clocked, sequenced design can pause the system and leverage external memory for storage of neuron state, providing a large pool of logical neurons, though limited by available memory bandwidth. Additionally, the ability to pause the network enables time-multiplexing and can harness the ability to re-use logic resources.
The FPGA results naturally consume more power and have reduced density[81]. The 8K spin result of Table I has W of static power dissipation due to the periphery included in the FPGA design, not all of which is used. Using the FPGA to ASIC power ratio of 14[81], a naive migration of the design to an ASIC would result in an of . By translating these approaches to a tailored ASIC implementation, the energy efficiency per flip increases and the ability to increase the density of resident neurons within a device increases substantially. Extending this further, it should be feasible to obtain designs with petaflips per second with a power budget of W, but we need devices with fJ that also support a density of 1M devices.
The CMOS based SRAM design of Hitachi[6] has an estimated energy per flip of fJ, though the neuron density limits the ability of the approach to obtain an of petaflips per second. Using a hybrid CMOS and MTJ design as would be encountered in modern commercial MRAMs [78], it should be feasible to obtain petaflips per sec as projected in the last column of Table I [27]. Modern MRAMs can achieve Gb densities; however, we limit the projection to a 1 Mb density based on a target power consumption of W for the neurons and synapses in the design.
It should be noted that the 20 W power target was arbitrarily chosen. In principle the autonomous designs can leverage clocking to choose when global p-bit updates should occur, much like the FPGA emulator discussed in the methods section. While this approach can save power, it limits the utility of the MTJ based approach by constraining the flips per second () to the same limits of Eq. (3a) due to the clocking scheme. Even with this limit in place, as the connectivity between neurons increases beyond nearest-neighbor, the sequencing logic becomes more complex while the ApC only requires a balance of to ensure proper convergence, though both approaches still face challenges with routing congestion as grows. In the case of all-to-all connectivity, the sequencing logic reduces in complexity and technically can be implemented with a single time-multiplexed weighted p-bit. In this situation, the benefits of an ApC begin to degrade, except for the elimination of memory bottlenecks with the use of distributed weights and the avoidance of time-multiplexing. A 500 node all-to-all network was implemented using the FPGA emulator, and the resulting was directly proportional to the number of neurons, affirming the reduced benefit.
VI Conclusion and Future Work
In this work we presented a vision for an autonomous probabilistic computer for applications in optimization and machine learning. Using stochastic nanomagnets as intrinsic p-bits, an ApC based on these devices provides an opportunity to realize a scalable co-processor achieving high-performance operation. As the technology is not yet available to build a scaled device, we presented a benchmarked behavior model for the autonomous computer that facilitated direct study of the ApC. This behavioral model was implemented in an FPGA to establish a framework for the study of various applications and design trade-offs. The results presented in Table I demonstrate that with the removal of sequencers, the ability to run at speeds limited only by synapse delays, and the ability scale to millions of neurons, all within an accessible power budget, the proposed ApC is a compelling alternative to clocked, sequential designs for stochastic ANN coprocessing.
We have also emphasized a key metric, flips per second, as a figure-of-merit for emerging probabilistic annealers that quantify their performance in terms of a problem or substrate independent manner. Improving flips per second by application specific, massively parallel or autonomous architectures using nanodevices as we have suggested could lead to efficient domain-specific, probabilistic coprocessors that can outperform conventional implementations in the future.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. D. Schuman, T. E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean, G. S. Rose, and J. S. Plank, “A Survey of Neuromorphic Computing and Neural Networks in Hardware,” ar Xiv:1705.06963 [cs] , May 2017, ar Xiv: 1705.06963.
- 2[2] G. Detorakis, S. Dutta, A. Khanna, M. Jerry, S. Datta, and E. Neftci, “Inherent weight normalization in stochastic neural networks,” in Advances in Neural Information Processing Systems , 2019, pp. 3286–3297.
- 3[3] L. Buesing, J. Bill, B. Nessler, and W. Maass, “Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons,” P Lo S computational biology , vol. 7, no. 11, p. e 1002211, 2011.
- 4[4] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” ar Xiv preprint ar Xiv:1602.02830 , 2016.
- 5[5] S. Boixo, T. F. Rønnow, S. V. Isakov, Z. Wang, D. Wecker, D. A. Lidar, J. M. Martinis, and M. Troyer, “Evidence for quantum annealing with more than one hundred qubits,” Nature Physics , vol. 10, no. 3, pp. 218–224, Mar. 2014.
- 6[6] M. Yamaoka, C. Yoshimura, M. Hayashi, T. Okuyama, H. Aoki, and H. Mizuno, “A 20k-Spin Ising Chip to Solve Combinatorial Optimization Problems With CMOS Annealing,” IEEE Journal of Solid-State Circuits , vol. 51, no. 1, pp. 303–309, Jan. 2016.
- 7[7] P. L. Mc Mahon, A. Marandi, Y. Haribara, R. Hamerly, C. Langrock, S. Tamate, T. Inagaki, H. Takesue, S. Utsunomiya, K. Aihara, R. L. Byer, M. M. Fejer, H. Mabuchi, and Y. Yamamoto, “A fully-programmable 100-spin coherent Ising machine with all-to-all connections,” Science , p. aah 5178, Oct. 2016.
- 8[8] T. Inagaki, Y. Haribara, K. Igarashi, T. Sonobe, S. Tamate, T. Honjo, A. Marandi, P. L. Mc Mahon, T. Umeki, K. Enbutsu, O. Tadanaga, H. Takenouchi, K. Aihara, K.-i. Kawarabayashi, K. Inoue, S. Utsunomiya, and H. Takesue, “A coherent Ising machine for 2000-node optimization problems,” Science , vol. 354, no. 6312, pp. 603–606, Nov. 2016.
