# Leveraging the bfloat16 Artificial Intelligence Datatype For   Higher-Precision Computations

**Authors:** Greg Henry, Ping Tak Peter Tang, Alexander Heinecke

arXiv: 1904.06376 · 2019-04-16

## TL;DR

This paper explores using BF16, a 16-bit floating point format, with higher-precision accumulation to accelerate matrix computations in AI, achieving up to 5.2x speed-up while maintaining accuracy comparable to FP64 in solving linear systems.

## Contribution

It introduces a method to perform high-precision matrix operations using BF16 with higher-precision accumulation, enabling faster computations without sacrificing accuracy.

## Key findings

- Achieves up to 5.2x speed-up in matrix operations.
- Maintains accuracy comparable to FP64 in linear system solutions.
- Demonstrates effective use of BF16 for high-precision AI computations.

## Abstract

In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to their extreme computational intensity. Compared to classical IEEE-754 32 bit (FP32) and 64 bit (FP64) arithmetic, these reduced precision arithmetic can naturally be sped up disproportional to their shortened width. The common strategy of all major hardware vendors is to aggressively further enhance their performance disproportionately. One particular FMA operation that multiplies two BF16 numbers while accumulating in FP32 has been found useful in deep learning, where BF16 is the 16-bit floating point datatype with IEEE FP32 numerical range but 8 significant bits of precision. In this paper, we examine the use this FMA unit to implement higher-precision matrix routines in terms of potential performance gain and implications on accuracy. We demonstrate how a decomposition into multiple smaller datatypes can be used to assemble a high-precision result, leveraging the higher precision accumulation of the FMA unit. We first demonstrate that computations of vector inner products and by natural extension, matrix-matrix products can be achieved by decomposing FP32 numbers in several BF16 numbers followed by appropriate computations that can accommodate the dynamic range and preserve accuracy compared to standard FP32 computations, while projecting up to 5.2x speed-up. Furthermore, we examine solution of linear equations formulated in the residual form that allows for iterative refinement. We demonstrate that the solution obtained to be comparable to those offered by FP64 under a large range of linear system condition numbers.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.06376/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1904.06376/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/1904.06376/full.md

---
Source: https://tomesphere.com/paper/1904.06376