Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

Tianheng Ling; Chao Qian; Gregor Schiele

arXiv:2410.03294·cs.LG·April 22, 2026

Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

Tianheng Ling, Chao Qian, Gregor Schiele

PDF

TL;DR

This paper presents a resource-aware mixed-precision quantization method that improves the deployability of Transformer models on resource-constrained embedded FPGAs, enabling more efficient edge AI applications.

Contribution

It introduces a flexible VHDL template and a quantization approach that accurately estimates resource usage, overcoming deployment limitations of uniform quantization configurations.

Findings

01

Achieved a precision discrepancy as low as 3% between estimates and actual deployment.

02

Enabled deployment of previously non-deployable configurations with mixed-precision quantization.

03

Facilitated broader application of Transformers on embedded FPGA devices.

Abstract

This study addresses the deployment challenges of integer-only quantized Transformers on resource-constrained embedded FPGAs (Xilinx Spartan-7 XC7S15). We enhanced the flexibility of our VHDL template by introducing a selectable resource type for storing intermediate results across model layers, thereby breaking the deployment bottleneck by utilizing BRAM efficiently. Moreover, we developed a resource-aware mixed-precision quantization approach that enables researchers to explore hardware-level quantization strategies without requiring extensive expertise in Neural Architecture Search. This method provides accurate resource utilization estimates with a precision discrepancy as low as 3%, compared to actual deployment metrics. Compared to previous work, our approach has successfully facilitated the deployment of model configurations utilizing mixed-precision quantization, thus overcoming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.