QuAKE: Speeding up Model Inference Using Quick and Approximate Kernels   for Exponential Non-Linearities

Sai Kiran Narayanaswami; Gopalakrishnan Srinivasan; Balaraman; Ravindran

arXiv:2412.00408·cs.LG·December 3, 2024

QuAKE: Speeding up Model Inference Using Quick and Approximate Kernels for Exponential Non-Linearities

Sai Kiran Narayanaswami, Gopalakrishnan Srinivasan, Balaraman, Ravindran

PDF

Open Access

TL;DR

QuAKE introduces novel approximate exponential operators that significantly accelerate inference in models like Transformers by leveraging floating point properties, achieving up to 45% speed gains with minimal performance loss.

Contribution

This work presents QuAKE, a set of hardware-agnostic operators that efficiently approximate exponential functions, improving inference speed without sacrificing model accuracy.

Findings

01

Inference speed improved by up to 45% on CPUs.

02

Minimal impact on model performance across various tasks.

03

Applicable to multiple exponential non-linearities like Softmax and GELU.

Abstract

As machine learning gets deployed more and more widely, and model sizes continue to grow, improving computational efficiency during model inference has become a key challenge. In many commonly used model architectures, including Transformers, a significant portion of the inference computation is comprised of exponential non-linearities such as Softmax. In this work, we develop QuAKE, a collection of novel operators that leverage certain properties of IEEE-754 floating point representations to quickly approximate the exponential function without requiring specialized hardware, extra memory, or precomputation. We propose optimizations that enhance the efficiency of QuAKE in commonly used exponential non-linearities such as Softmax, GELU, and the Logistic function. Our benchmarks demonstrate substantial inference speed improvements between 10% and 35% on server CPUs, and 5% and 45% on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax