QuAKE: Speeding up Model Inference Using Quick and Approximate Kernels for Exponential Non-Linearities
Sai Kiran Narayanaswami, Gopalakrishnan Srinivasan, Balaraman, Ravindran

TL;DR
QuAKE introduces novel approximate exponential operators that significantly accelerate inference in models like Transformers by leveraging floating point properties, achieving up to 45% speed gains with minimal performance loss.
Contribution
This work presents QuAKE, a set of hardware-agnostic operators that efficiently approximate exponential functions, improving inference speed without sacrificing model accuracy.
Findings
Inference speed improved by up to 45% on CPUs.
Minimal impact on model performance across various tasks.
Applicable to multiple exponential non-linearities like Softmax and GELU.
Abstract
As machine learning gets deployed more and more widely, and model sizes continue to grow, improving computational efficiency during model inference has become a key challenge. In many commonly used model architectures, including Transformers, a significant portion of the inference computation is comprised of exponential non-linearities such as Softmax. In this work, we develop QuAKE, a collection of novel operators that leverage certain properties of IEEE-754 floating point representations to quickly approximate the exponential function without requiring specialized hardware, extra memory, or precomputation. We propose optimizations that enhance the efficiency of QuAKE in commonly used exponential non-linearities such as Softmax, GELU, and the Logistic function. Our benchmarks demonstrate substantial inference speed improvements between 10% and 35% on server CPUs, and 5% and 45% on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax
