High Accuracy Low Precision QR Factorization and Least Square Solver on   GPU with TensorCore

Shaoshuai Zhang; Panruo Wu

arXiv:1912.05508·cs.MS·December 12, 2019·1 cites

High Accuracy Low Precision QR Factorization and Least Square Solver on GPU with TensorCore

Shaoshuai Zhang, Panruo Wu

PDF

Open Access

TL;DR

This paper introduces a high-accuracy, low-precision QR factorization and least squares solver on GPU using TensorCore, achieving significant speedups while maintaining acceptable accuracy levels.

Contribution

It presents a novel mixed precision algorithm and implementation that leverages TensorCore GPU units for efficient large-scale linear least squares solving.

Findings

01

Up to 14x faster than single precision cuSOLVER for QR factorization.

02

Up to 10x faster than double precision solver with similar accuracy.

03

Achieves high accuracy with low precision TensorCore computations.

Abstract

Driven by the insatiable needs to process ever larger amount of data with more complex models, modern computer processors and accelerators are beginning to offer half precision floating point arithmetic support, and extremely optimized special units such as NVIDIA TensorCore on GPU and Google Tensor Processing Unit (TPU) that does half precision matrix-matrix multiplication exceptionally efficiently. In this paper we present a large scale mixed precision linear least square solver that achieves high accuracy using the low precision TensorCore GPU. The mixed precision system consists of both innovative algorithms and implementations, and is shown to be up to 14x faster than single precision cuSOLVER at QR matrix factorization at large scale with slightly lower accuracy, and up to 10x faster than double precision direct QR least square solver with comparable accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Matrix Theory and Algorithms