Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

Bugra Kilictas; Faruk Alpay

arXiv:2601.03324·cs.CL·January 8, 2026

Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

Bugra Kilictas, Faruk Alpay

PDF

Open Access

TL;DR

This paper introduces a software-based tensor virtualization architecture optimized for ARM64 that significantly reduces memory bottlenecks in edge-AI inference, enabling efficient LLM deployment on devices like Apple Silicon.

Contribution

It presents a novel Virtual Tensor Core architecture with a custom tensor layout and zero-copy loader, improving memory utilization and inference throughput on ARM64 edge devices.

Findings

01

Achieves >60 tokens/sec on 110M parameter model on M2 hardware

02

Guarantees 100% cache line utilization for weight matrices

03

Provides a portable, deterministic implementation for studying memory bottlenecks

Abstract

The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of "Software-Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy