# Supporting mixed-datatype matrix multiplication within the BLIS   framework

**Authors:** Field G. Van Zee, Devangi N. Parikh, Robert A. van de Geijn

arXiv: 1901.06015 · 2019-05-03

## TL;DR

This paper extends the BLIS framework to efficiently support mixed-datatype matrix multiplication, enabling flexible precision and domain combinations with minimal performance loss across diverse hardware.

## Contribution

It introduces a systematic approach for supporting all combinations of real and complex, single and double precision matrices in BLIS, using typecasting and optimized microkernels.

## Key findings

- High performance maintained on multicore processors
- Modest slowdown due to typecasting instructions
- 128 datatype combinations implemented with only two microkernels

## Abstract

We approach the problem of implementing mixed-datatype support within the general matrix multiplication (GEMM) operation of the BLIS framework, whereby each matrix operand A, B, and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the computation is allowed to take place in a precision different from the storage precisions of either A or B, is also included in the discussion. We first break the problem into mostly orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation---during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatoric intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.06015/full.md

## Figures

34 figures with captions in the complete paper: https://tomesphere.com/paper/1901.06015/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1901.06015/full.md

---
Source: https://tomesphere.com/paper/1901.06015