DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training
Maoyang Xiang, Bo Wang

TL;DR
This paper introduces DAPA, a novel distribution-aware piecewise activation function designed to enhance on-device Transformer inference and training by reducing resource consumption and increasing speed.
Contribution
DAPA is a differentiable, hardware-friendly activation that adapts to data distribution, offering significant speedups and resource savings for Transformer models.
Findings
DAPA speeds up GELU computation by 16x.
DAPA reduces DSP utilization by 16x.
Maintains or improves model performance.
Abstract
Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors
