Accelerated AI Inference via Dynamic Execution Methods

Haim Barad; Jascha Achterberg; Tien Pei Chou; Jean Yu

arXiv:2411.00853·cs.LG·November 5, 2024

Accelerated AI Inference via Dynamic Execution Methods

Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu

PDF

Open Access

TL;DR

This paper explores dynamic execution techniques that adapt AI inference processes based on input complexity, significantly improving latency and throughput while reducing resource consumption in generative AI models.

Contribution

It introduces novel dynamic execution methods for generative AI, including adaptive early stopping and input-dependent sampling, integrated into popular AI libraries.

Findings

01

Significant latency reduction in AI inference tasks.

02

Improved resource efficiency without quality loss.

03

Enhanced model performance through combined optimizations.

Abstract

In this paper, we focus on Dynamic Execution techniques that optimize the computation flow based on input. This aims to identify simpler problems that can be solved using fewer resources, similar to human cognition. The techniques discussed include early exit from deep networks, speculative sampling for language models, and adaptive steps for diffusion models. Experimental results demonstrate that these dynamic approaches can significantly improve latency and throughput without compromising quality. When combined with model-based optimizations, such as quantization, dynamic execution provides a powerful multi-pronged strategy to optimize AI inference. Generative AI requires a large amount of compute resources. This is expected to grow, and demand for resources in data centers through to the edge is expected to continue to increase at high rates. We take advantage of existing research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Radiation Effects in Electronics · Adversarial Robustness in Machine Learning

MethodsEarly Stopping · Diffusion · Focus