Accelerated AI Inference via Dynamic Execution Methods
Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu

TL;DR
This paper explores dynamic execution techniques that adapt AI inference processes based on input complexity, significantly improving latency and throughput while reducing resource consumption in generative AI models.
Contribution
It introduces novel dynamic execution methods for generative AI, including adaptive early stopping and input-dependent sampling, integrated into popular AI libraries.
Findings
Significant latency reduction in AI inference tasks.
Improved resource efficiency without quality loss.
Enhanced model performance through combined optimizations.
Abstract
In this paper, we focus on Dynamic Execution techniques that optimize the computation flow based on input. This aims to identify simpler problems that can be solved using fewer resources, similar to human cognition. The techniques discussed include early exit from deep networks, speculative sampling for language models, and adaptive steps for diffusion models. Experimental results demonstrate that these dynamic approaches can significantly improve latency and throughput without compromising quality. When combined with model-based optimizations, such as quantization, dynamic execution provides a powerful multi-pronged strategy to optimize AI inference. Generative AI requires a large amount of compute resources. This is expected to grow, and demand for resources in data centers through to the edge is expected to continue to increase at high rates. We take advantage of existing research…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Radiation Effects in Electronics · Adversarial Robustness in Machine Learning
MethodsEarly Stopping · Diffusion · Focus
