A Heterogeneous Accelerated Matrix Multiplication: OpenCL + APU + GPU+   Fast Matrix Multiply

Paolo D'Alberto

arXiv:1205.2927·cs.MS·May 15, 2012

A Heterogeneous Accelerated Matrix Multiplication: OpenCL + APU + GPU+ Fast Matrix Multiply

Paolo D'Alberto

PDF

Open Access

TL;DR

This paper explores a hybrid computing approach combining OpenCL, APU, and GPU for dense matrix multiplication, achieving high performance on a custom-built system with up to 200 GFLOPS.

Contribution

It demonstrates a practical implementation of matrix multiplication across heterogeneous processors, highlighting performance optimization and challenges in a hybrid hardware environment.

Findings

01

Achieved up to 200 GFLOPS performance.

02

Successfully utilized APU and GPU for large matrix sizes.

03

Identified key quirks and pitfalls in hybrid matrix multiplication.

Abstract

As users and developers, we are witnessing the opening of a new computing scenario: the introduction of hybrid processors into a single die, such as an accelerated processing unit (APU) processor, and the plug-and-play of additional graphics processing units (GPUs) onto a single motherboard. These APU processors provide multiple symmetric cores with their memory hierarchies and an integrated GPU. Moreover, these processors are designed to work with external GPUs that can push the peak performance towards the TeraFLOPS boundary. We present a case study for the development of dense Matrix Multiplication (MM) codes for matrix sizes up to 19K\times19K, thus using all of the above computational engines, and an achievable peak performance of 200 GFLOPS for, literally, a made- at-home built. We present the results of our experience, the quirks, the pitfalls, the achieved performance, and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Interconnection Networks and Systems