Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture   and Automated Deployment Flow

Philip Wiese; Gamze \.Islamo\u{g}lu; Moritz Scherer; Luka Macan,; Victor J.B. Jung; Alessio Burrello; Francesco Conti; Luca Benini

arXiv:2408.02473·cs.AR·January 10, 2025

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

Philip Wiese, Gamze \.Islamo\u{g}lu, Moritz Scherer, Luka Macan,, Victor J.B. Jung, Alessio Burrello, Francesco Conti, Luca Benini

PDF

Open Access 1 Repo

TL;DR

This paper presents a heterogeneous architecture combining RISC-V processors and accelerators for tinyML, enabling efficient deployment of attention-based models with high energy efficiency and throughput.

Contribution

It introduces an automated deployment flow and a heterogeneous architecture tailored for attention-based tinyML models, advancing the state-of-the-art in energy-efficient inference.

Findings

01

Achieved 2960 GOp/J energy efficiency

02

Reached 154 GOp/s throughput

03

Demonstrated end-to-end 8-bit Transformer inference

Abstract

One of the challenges for Tiny Machine Learning (tinyML) is keeping up with the evolution of Machine Learning models from Convolutional Neural Networks to Transformers. We address this by leveraging a heterogeneous architectural template coupling RISC-V processors with hardwired accelerators supported by an automated deployment flow. We demonstrate Attention-based models in a tinyML power envelope with an octa-core cluster coupled with an accelerator for quantized Attention. Our deployment flow enables end-to-end 8-bit Transformer inference, achieving leading-edge energy efficiency and throughput of 2960 GOp/J and 154 GOp/s (0.65 V, 22 nm FD-SOI technology).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pulp-platform/deeploy
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Graph Theory and Algorithms · Distributed and Parallel Computing Systems

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Dropout · Adam · Position-Wise Feed-Forward Layer · Label Smoothing · Transformer · Softmax · Linear Layer