# Spike-based Q-learning in a non-von Neumann architecture

**Authors:** Donghyuk Shin, Hyeongcheol Jo, Hyeseung Jang, Yoo Ho Jeong, YeonJoo Jeong, Joon Young Kwak, Jongkil Park, Suyoun Lee, Inho Kim, Jong-Keuk Park, Seongsik Park, Hyun Jae Jang, Hyung-Min Lee, Jaewook Kim

PMC · DOI: 10.3389/fnins.2026.1738140 · Frontiers in Neuroscience · 2026-02-03

## TL;DR

This paper introduces a non-von Neumann hardware architecture for Q-learning using spiking neural networks, improving computational efficiency and reducing power consumption.

## Contribution

A hardware-feasible non-von Neumann architecture for spike-based Q-learning with local Q-value storage and efficient updates.

## Key findings

- The architecture uses one-hot encoding to map states and actions to neurons and stores Q-values locally in synapses.
- Simulations on the Cart-pole benchmark showed stable learning performance with low-bit precision.
- The design achieves comparable accuracy to software-based Q-learning when sufficient bit precision is used.

## Abstract

Non-von Neumann architectures overcome the memory-compute separation of von Neumann systems by distributing computation and memory locally, thereby reducing data-transfer bottlenecks and power consumption. These features are particularly advantageous for reinforcement learning (RL) workloads that rely on frequent value-function updates across large state-action spaces. When combined with event-driven spiking neural networks (SNNs), non-von Neumann architectures can further improve overall computational efficiency by leveraging the sparse nature of spike-based processing. In this study, we propose a hardware-feasible SNN-based non-von Neumann architecture that performs Q-learning, one of the most widely known reinforcement learning algorithms. The proposed architecture maps states and actions to individual neurons using one-hot encoding and locally stores each state–action pair's Q-value in the corresponding synapse. To enable each synapse to update its local Q-value based on the next state maximum Q stored in other synapses, a neuron group connected through a lateral inhibition structure is employed to produce the maximum Q, which is then globally transmitted to all synapses. A delay circuit is also added to align the next-state and current-state values to ensure temporally consistent updates. Each synapse locally generates a learning selection signal and combines it with the globally transmitted signals to update only the target synapse. The proposed architecture was validated through simulations on the Cart-pole benchmark, showing stable learning performance under low-bit precision and achieving comparable accuracy to software-based Q-learning with sufficient bit precision.

## Full-text entities

- **Genes:** LIF (LIF interleukin 6 family cytokine) [NCBI Gene 3976] {aka CDF, DIA, HILDA, MLPLI}, CUP2Q35 (Syndactyly, type I) [NCBI Gene 57306] {aka C2DUPq35, SD1, SDTY1}, APOE (apolipoprotein E) [NCBI Gene 348] {aka AD2, APO-E, ApoE4, LDLCQ5, LPG}
- **Diseases:** depression (MESH:D003866)
- **Chemicals:** ET12 (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12960630/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12960630/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12960630/full.md

---
Source: https://tomesphere.com/paper/PMC12960630