A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE
Ikumi Okubo, Keisuke Sugiura, Hiroki Matsutani

TL;DR
This paper presents a lightweight, FPGA-implemented Tiny Transformer model using Neural ODEs and quantization, achieving high accuracy and significant speedup for edge image recognition tasks.
Contribution
It introduces a novel hybrid model with Neural ODE backbone, optimized for FPGA deployment, reducing resource use and maintaining accuracy.
Findings
Achieves 79.68% top-1 accuracy on STL10 dataset.
Accelerates inference by 34.01× for backbone and MHSA components.
Provides 7.10× energy efficiency improvement over ARM Cortex-A53.
Abstract
Transformer has been adopted to image recognition tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity. To address these issues, a hybrid approach has become a recent research trend, which replaces a part of ResNet with an MHSA (Multi-Head Self-Attention). In this paper, we propose a lightweight hybrid model which uses Neural ODE (Ordinary Differential Equation) as a backbone instead of ResNet so that we can increase the number of iterations of building blocks while reusing the same parameters, mitigating the increase in parameter size per iteration. The proposed model is deployed on a modest-sized FPGA device for edge computing. The model is further quantized by QAT (Quantization Aware Training) scheme to reduce FPGA resource utilization while suppressing the accuracy loss. The quantized model achieves 79.68% top-1 accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Applications · CCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Average Pooling · Linear Layer · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Kaiming Initialization · Residual Connection · Absolute Position Encodings · Dense Connections · Position-Wise Feed-Forward Layer
