A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic

Kuo-Wei Chang; Hsu-Tung Shih; Tian-Sheuan Chang; Shang-Hong Tsai,; Chih-Chyau Yang; Chien-Ming Wu; Chun-Ming Huang

arXiv:2205.01571·cs.AR·May 4, 2022

A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic

Kuo-Wei Chang, Hsu-Tung Shih, Tian-Sheuan Chang, Shang-Hong Tsai,, Chih-Chyau Yang, Chien-Ming Wu, Chun-Ming Huang

PDF

TL;DR

This paper presents a low-memory-traffic deep learning accelerator chip optimized for real-time 720p object detection, significantly reducing memory bandwidth and energy consumption through hardware-software co-design and model fusion techniques.

Contribution

It introduces a novel hardware-software co-optimized DLA chip that employs model fusion to drastically cut memory traffic and energy use for high-definition object detection.

Findings

01

Reduces YOLOv2 feature memory traffic from 2.9 GB/s to 0.15 GB/s

02

Supports 1280x720@30FPS object detection in real-time

03

Consumes 7.9 times less external DRAM energy compared to previous designs

Abstract

Memory bandwidth has become the real-time bottleneck of current deep learning accelerators (DLA), particularly for high definition (HD) object detection. Under resource constraints, this paper proposes a low memory traffic DLA chip with joint hardware and software optimization. To maximize hardware utilization under memory bandwidth, we morph and fuse the object detection model into a group fusion-ready model to reduce intermediate data access. This reduces the YOLOv2's feature memory traffic from 2.9 GB/s to 0.15 GB/s. To support group fusion, our previous DLA based hardware employes a unified buffer with write-masking for simple layer-by-layer processing in a fusion group. When compared to our previous DLA with the same PE numbers, the chip implemented in a TSMC 40nm process supports 1280x720@30FPS object detection and consumes 7.9X less external DRAM access energy, from 2607 mJ to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDeep Layer Aggregation