Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory Tensor Manipulation for High-Throughput AI SoC

Weiyu Zhou; Zheng Wang; Chao Chen; Yike Li; Yongkui Yang; Zhuoyu Wu; Anupam Chattopadhyay

arXiv:2506.14364·cs.AR·June 18, 2025

Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory Tensor Manipulation for High-Throughput AI SoC

Weiyu Zhou, Zheng Wang, Chao Chen, Yike Li, Yongkui Yang, Zhuoyu Wu, Anupam Chattopadhyay

PDF

Open Access

TL;DR

The paper introduces the Tensor Manipulation Unit (TMU), a reconfigurable near-memory hardware block that significantly accelerates tensor data movement and manipulation, reducing latency and improving pipeline efficiency in AI systems.

Contribution

It presents the design and integration of a reconfigurable TMU for efficient tensor data movement, supporting diverse operators and enhancing AI SoC performance.

Findings

01

Achieves up to 1413x latency reduction over ARM A72.

02

Supports over 10 tensor manipulation operators.

03

Reduces end-to-end inference latency by 34.6% when integrated with TPU.

Abstract

While recent advances in AI SoC design have focused heavily on accelerating tensor computation, the equally critical task of tensor manipulation, centered on high,volume data movement with minimal computation, remains underexplored. This work addresses that gap by introducing the Tensor Manipulation Unit (TMU), a reconfigurable, near-memory hardware block designed to efficiently execute data-movement-intensive operators. TMU manipulates long datastreams in a memory-to-memory fashion using a RISC-inspired execution model and a unified addressing abstraction, enabling broad support for both coarse- and fine-grained tensor transformations. Integrated alongside a TPU within a high-throughput AI SoC, the TMU leverages double buffering and output forwarding to improve pipeline utilization. Fabricated in SMIC 40nm technology, the TMU occupies only 0.019 mm2 while supporting over 10…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Advanced Neural Network Applications