Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs
Tianheng Ling, Chao Qian, Gregor Schiele

TL;DR
This paper presents a resource-aware mixed-precision quantization method that improves the deployability of Transformer models on resource-constrained embedded FPGAs, enabling more efficient edge AI applications.
Contribution
It introduces a flexible VHDL template and a quantization approach that accurately estimates resource usage, overcoming deployment limitations of uniform quantization configurations.
Findings
Achieved a precision discrepancy as low as 3% between estimates and actual deployment.
Enabled deployment of previously non-deployable configurations with mixed-precision quantization.
Facilitated broader application of Transformers on embedded FPGA devices.
Abstract
This study addresses the deployment challenges of integer-only quantized Transformers on resource-constrained embedded FPGAs (Xilinx Spartan-7 XC7S15). We enhanced the flexibility of our VHDL template by introducing a selectable resource type for storing intermediate results across model layers, thereby breaking the deployment bottleneck by utilizing BRAM efficiently. Moreover, we developed a resource-aware mixed-precision quantization approach that enables researchers to explore hardware-level quantization strategies without requiring extensive expertise in Neural Architecture Search. This method provides accurate resource utilization estimates with a precision discrepancy as low as 3%, compared to actual deployment metrics. Compared to previous work, our approach has successfully facilitated the deployment of model configurations utilizing mixed-precision quantization, thus overcoming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
