Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling
Rajeev Patwari, Ashish Sirasao, Devleena Das

TL;DR
LIFE is a hardware-agnostic analytical framework that accurately forecasts LLM inference performance across diverse devices by modeling operator behavior and optimizations using only hardware specifications.
Contribution
The paper introduces LIFE, a modular analytical model that predicts LLM inference performance without extensive benchmarking, accommodating hardware and software optimizations.
Findings
LIFE accurately forecasts inference metrics on various hardware platforms.
The framework accounts for software optimizations like quantization and operator fusion.
Validation shows LIFE's predictions align well with real-world inference performance.
Abstract
Large language models (LLMs) have been increasingly deployed as local agents on personal devices with CPUs, NPUs and integrated GPUs. However, forecasting inference performance on devices with such heterogeneity remains challenging due to the dynamic compute and memory demands. Existing approaches rely on GPU benchmarking or machine learning-based latency predictors, which are often hardware-specific and lack generalizability. To this end, we introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators, configurable to characterize LLM inference workloads in a hardware and dataset-agnostic manner. LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion, on performance metrics such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Parallel Computing and Optimization Techniques · Natural Language Processing Techniques
