TROOP: At-the-Roofline Performance for Vector Processors on Low Operational Intensity Workloads
Navaneeth Kunhi Purayil, Diyou Shen, Matteo Perotti, Luca Benini

TL;DR
TROOP introduces hardware optimizations for vector processors to maximize L1 memory bandwidth utilization, achieving at-the-roofline performance and significant energy efficiency improvements for memory-bound workloads.
Contribution
The paper presents TROOP, a novel set of micro-architectural enhancements that enable vector processing elements to reach peak memory bandwidth utilization without increasing area or energy costs.
Findings
Achieves up to 2.6x speedup on key kernels
Reaches at-the-roofline performance for memory-bound workloads
Improves energy efficiency by up to 45%
Abstract
The fast evolution of Machine Learning (ML) models requires flexible and efficient hardware solutions as hardwired accelerators face rapid obsolescence. Vector processors are fully programmable and achieve high energy efficiencies by exploiting data parallelism, amortizing instruction fetch and decoding costs. Hence, a promising design choice is to build accelerators based on shared L1-memory clusters of streamlined Vector Processing Elements (VPEs). However, current state-of-the-art VPEs are limited in L1 memory bandwidth and achieve high efficiency only for computational kernels with high data reuse in the Vector Register File (VRF), such as General Matrix Multiplication (GEMM). Performance is suboptimal for workloads with lower data reuse like General Matrix-Vector Multiplication (GEMV). To fully exploit available bandwidth at the L1 memory interface, the VPE micro-architecture must…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
