Memory Access Characterization of Large Language Models in CPU Environment and its Potential Impacts
Spencer Banasik

TL;DR
This paper investigates how cache architecture modifications can enhance large language model inference speed on CPUs by analyzing memory access patterns and performance in CPU-only environments.
Contribution
It provides a detailed analysis of memory access patterns and proposes potential cache optimizations for improving LLM inference on CPUs.
Findings
Identified key memory access bottlenecks in LLM inference on CPUs.
Evaluated various cache configurations for performance improvements.
Provided insights into memory footprint patterns of LLMs.
Abstract
As machine learning algorithms are shown to be an increasingly valuable tool, the demand for their access has grown accordingly. Oftentimes, it is infeasible to run inference with larger models without an accelerator, which may be unavailable in environments that have constraints such as energy consumption, security, or cost. To increase the availability of these models, we aim to improve the LLM inference speed on a CPU-only environment by modifying the cache architecture. To determine what improvements could be made, we conducted two experiments using Llama.cpp and the QWEN model: running various cache configurations and evaluating their performance, and outputting a trace of the memory footprint. Using these experiments, we investigate the memory access patterns and performance characteristics to identify potential optimizations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Topic Modeling
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
