CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

Jinpeng Ye; Chongxi Wang; Wenqing Li; Bin Yuan; Shiyi Wang; Fenglu Zhang; Junyu Yue; Jianan Xie; Yunhao Ye; Haoyu Deng; Yingkun Zhou; Xin Cheng; Fuxin Zhang; Jian Wang

arXiv:2604.11615·cs.AR·April 14, 2026

CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

Jinpeng Ye, Chongxi Wang, Wenqing Li, Bin Yuan, Shiyi Wang, Fenglu Zhang, Junyu Yue, Jianan Xie, Yunhao Ye, Haoyu Deng, Yingkun Zhou, Xin Cheng, Fuxin Zhang, Jian Wang

PDF

TL;DR

This paper introduces a unified, configurable matrix extension for CPUs that reduces design overhead, enhances AI workload performance, and is adaptable across diverse architectures with open-source implementation.

Contribution

It presents a decoupled, flexible matrix extension architecture supporting mixed-precision operations and asynchronous abstraction, integrated into open-source CPU platforms.

Findings

01

Matrix unit utilization exceeds 90% on GEMM workloads.

02

Achieves up to 2.31x speedup on AI models like Llama3.

03

Design occupies only 0.53 mm² in 14nm CMOS.

Abstract

Matrix extensions have emerged as an essential feature in modern CPUs to address the surging demands of AI workloads. However, existing designs often incur substantial hardware and software design overhead. Tight coupling with the CPU pipeline complicates integration across diverse CPUs, while fine-grained synchronous instructions hinder the development of high-performance kernels. This paper proposes a unified and configurable CPU matrix extension architecture. By decoupling matrix units from the CPU pipeline, the design enables low-overhead integration while maintaining close coordination with existing compute and memory resources. The configurable matrix unit supports mixed-precision operations and adapts to diverse compute demands and memory bandwidth constraints. An asynchronous matrix multiplication abstraction with flexible granularity conceals hardware details, simplifies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.