# Architecture Design of a Convolutional Neural Network Accelerator for Heterogeneous Computing Based on a Fused Systolic Array

**Authors:** Yang Zong, Zhenhao Ma, Jian Ren, Yu Cao, Meng Li, Bin Liu

PMC · DOI: 10.3390/s26020628 · Sensors (Basel, Switzerland) · 2026-01-16

## TL;DR

This paper introduces a new CNN accelerator design that improves energy efficiency and performance for embedded systems using a fused systolic array and heterogeneous computing.

## Contribution

A novel CNN accelerator architecture combining CPU and ASIC with fused systolic arrays and optimized RISC-V core for improved energy efficiency.

## Key findings

- The proposed architecture achieves 10.46 GOPs/W energy efficiency, outperforming existing accelerators by 58–350%.
- It delivers 20.6 GFLOPs of computational performance with 1.96 W power consumption on a development board.
- Operator fusion and prefetching strategies enhance stability and reduce resource usage.

## Abstract

Convolutional Neural Networks (CNNs) generally suffer from excessive computational overhead, high resource consumption, and complex network structures, which severely restrict the deployment on microprocessor chips. Existing related accelerators only have an energy efficiency ratio of 2.32–6.5925 GOPs/W, making it difficult to meet the low-power requirements of embedded application scenarios. To address these issues, this paper proposes a low-power and high-energy-efficiency CNN accelerator architecture based on a central processing unit (CPU) and an Application-Specific Integrated Circuit (ASIC) heterogeneous computing architecture, adopting an operator-fused systolic array algorithm with the YOLOv5n target detection network as the application benchmark. It integrates a 2D systolic array with Conv-BN fusion technology to achieve deep operator fusion of convolution, batch normalization and activation functions; optimizes the RISC-V core to reduce resource usage; and adopts a locking mechanism and a prefetching strategy for the asynchronous platform to ensure operational stability. Experiments on the Nexys Video development board show that the architecture achieves 20.6 GFLOPs of computational performance, 1.96 W of power consumption, and 10.46 GOPs/W of energy efficiency ratio, which is 58–350% higher than existing mainstream accelerators, thus demonstrating excellent potential for embedded deployment.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12845848/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12845848/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12845848/full.md

---
Source: https://tomesphere.com/paper/PMC12845848