The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

H\'ector Mart\'inez; Adri\'an Castell\'o; Francisco D. Igual; Enrique S. Quintana-Ort\'i

arXiv:2506.11728·cs.CL·June 16, 2025

The Cambrian Explosion of Mixed-Precision Matrix Multiplication for Quantized Deep Learning Inference

H\'ector Mart\'inez, Adri\'an Castell\'o, Francisco D. Igual, Enrique S. Quintana-Ort\'i

PDF

Open Access

TL;DR

This paper explores the evolution of matrix multiplication optimization for mixed-precision integer arithmetic in deep learning inference, demonstrating significant performance improvements on modern CPU architectures.

Contribution

It introduces novel micro-kernel designs and data layouts for mixed-precision integer GEMM, adapting traditional high-performance methods to modern hardware for DL inference.

Findings

01

MIP arithmetic outperforms floating-point in GEMM on modern CPUs.

02

New micro-kernels exploit specialized hardware features effectively.

03

Significant performance gains demonstrated across x86_64, ARM, and RISC-V architectures.

Abstract

Recent advances in deep learning (DL) have led to a shift from traditional 64-bit floating point (FP64) computations toward reduced-precision formats, such as FP16, BF16, and 8- or 16-bit integers, combined with mixed-precision arithmetic. This transition enhances computational throughput, reduces memory and bandwidth usage, and improves energy efficiency, offering significant advantages for resource-constrained edge devices. To support this shift, hardware architectures have evolved accordingly, now including adapted ISAs (Instruction Set Architectures) that expose mixed-precision vector units and matrix engines tailored for DL workloads. At the heart of many DL and scientific computing tasks is the general matrix-matrix multiplication gemm, a fundamental kernel historically optimized using axpy vector instructions on SIMD (single instruction, multiple data) units. However, as hardware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification