Memory Access Characterization of Large Language Models in CPU Environment and its Potential Impacts

Spencer Banasik

arXiv:2506.01827·cs.LG·June 3, 2025

Memory Access Characterization of Large Language Models in CPU Environment and its Potential Impacts

Spencer Banasik

PDF

Open Access

TL;DR

This paper investigates how cache architecture modifications can enhance large language model inference speed on CPUs by analyzing memory access patterns and performance in CPU-only environments.

Contribution

It provides a detailed analysis of memory access patterns and proposes potential cache optimizations for improving LLM inference on CPUs.

Findings

01

Identified key memory access bottlenecks in LLM inference on CPUs.

02

Evaluated various cache configurations for performance improvements.

03

Provided insights into memory footprint patterns of LLMs.

Abstract

As machine learning algorithms are shown to be an increasingly valuable tool, the demand for their access has grown accordingly. Oftentimes, it is infeasible to run inference with larger models without an accelerator, which may be unavailable in environments that have constraints such as energy consumption, security, or cost. To increase the availability of these models, we aim to improve the LLM inference speed on a CPU-only environment by modifying the cache architecture. To determine what improvements could be made, we conducted two experiments using Llama.cpp and the QWEN model: running various cache configurations and evaluating their performance, and outputting a trace of the memory footprint. Using these experiments, we investigate the memory access patterns and performance characteristics to identify potential optimizations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings