TROOP: At-the-Roofline Performance for Vector Processors on Low Operational Intensity Workloads

Navaneeth Kunhi Purayil; Diyou Shen; Matteo Perotti; Luca Benini

arXiv:2508.03900·cs.AR·January 7, 2026

TROOP: At-the-Roofline Performance for Vector Processors on Low Operational Intensity Workloads

Navaneeth Kunhi Purayil, Diyou Shen, Matteo Perotti, Luca Benini

PDF

TL;DR

TROOP introduces hardware optimizations for vector processors to maximize L1 memory bandwidth utilization, achieving at-the-roofline performance and significant energy efficiency improvements for memory-bound workloads.

Contribution

The paper presents TROOP, a novel set of micro-architectural enhancements that enable vector processing elements to reach peak memory bandwidth utilization without increasing area or energy costs.

Findings

01

Achieves up to 2.6x speedup on key kernels

02

Reaches at-the-roofline performance for memory-bound workloads

03

Improves energy efficiency by up to 45%

Abstract

The fast evolution of Machine Learning (ML) models requires flexible and efficient hardware solutions as hardwired accelerators face rapid obsolescence. Vector processors are fully programmable and achieve high energy efficiencies by exploiting data parallelism, amortizing instruction fetch and decoding costs. Hence, a promising design choice is to build accelerators based on shared L1-memory clusters of streamlined Vector Processing Elements (VPEs). However, current state-of-the-art VPEs are limited in L1 memory bandwidth and achieve high efficiency only for computational kernels with high data reuse in the Vector Register File (VRF), such as General Matrix Multiplication (GEMM). Performance is suboptimal for workloads with lower data reuse like General Matrix-Vector Multiplication (GEMV). To fully exploit available bandwidth at the L1 memory interface, the VPE micro-architecture must…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.