Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64
Bugra Kilictas, Faruk Alpay

TL;DR
This paper introduces a software-based tensor virtualization architecture optimized for ARM64 that significantly reduces memory bottlenecks in edge-AI inference, enabling efficient LLM deployment on devices like Apple Silicon.
Contribution
It presents a novel Virtual Tensor Core architecture with a custom tensor layout and zero-copy loader, improving memory utilization and inference throughput on ARM64 edge devices.
Findings
Achieves >60 tokens/sec on 110M parameter model on M2 hardware
Guarantees 100% cache line utilization for weight matrices
Provides a portable, deterministic implementation for studying memory bottlenecks
Abstract
The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of "Software-Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy
