QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

Xuzhi Zhang; Shaohui Peng; Qirui Zhou; Yuanbo Wen; Qi Guo; Ruizhi Chen; Xinguo Zhu; Weiqiang Xiong; Haixin Chen; Congying Ma; Ke Gao; Chen Zhao; Yanjun Wu; Yunji Chen; and Ling Li

arXiv:2505.06302·cs.LG·May 13, 2025

QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, and Ling Li

PDF

Open Access

TL;DR

QiMeng-TensorOp is a framework that enables large language models to automatically generate high-performance tensor operators tailored to diverse hardware architectures, significantly improving efficiency and reducing development costs.

Contribution

This work introduces a novel auto-generation framework that leverages LLMs to produce optimized tensor operators with hardware primitives across various architectures.

Findings

01

Achieves up to 1291x performance improvement over vanilla LLMs.

02

Reaches 251% of OpenBLAS performance on RISC-V CPUs.

03

Reduces development costs by 200x compared to human experts.

Abstract

Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Parallel Computing and Optimization Techniques · Machine Learning in Materials Science