The Immutable Tensor Architecture: A Pure Dataflow Approach for Secure, Energy-Efficient AI Inference
Fang Li

TL;DR
The paper introduces the Immutable Tensor Architecture, a novel hardware paradigm that encodes model weights directly into ASIC interconnects, eliminating memory bottlenecks for energy-efficient AI inference on edge devices.
Contribution
It proposes a new hardware architecture that treats model weights as physical circuit topology, removing the need for traditional memory hierarchies in AI inference.
Findings
Eliminates memory hierarchy to reduce energy consumption.
Enables secure, energy-efficient LLM deployment on edge devices.
Uses a 'Split-Brain' design with host CPU and ASIC for flexible inference.
Abstract
The deployment of Large Language Models (LLMs) on consumer edge devices is throttled by the "Memory Wall" -- the prohibitive bandwidth and energy cost of fetching gigabytes of model weights from DRAM for every token generated. Current architectures (GPUs, NPUs) treat model weights as mutable software data, incurring massive energy penalties to maintain general-purpose programmability. We propose The Immutable Tensor Architecture (ITA), a paradigm shift that treats model weights not as data, but as physical circuit topology. By encoding parameters directly into the metal interconnects and logic of mature-node ASICs (28nm/40nm), ITA eliminates the memory hierarchy entirely. We present a "Split-Brain" system design where a host CPU manages dynamic KV-cache operations while the ITA ASIC acts as a stateless, ROM-embedded dataflow engine.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Security and Verification in Computing
