Architecture-Aware Configuration and Scheduling of Matrix Multiplication   on Asymmetric Multicore Processors

Sandra Catal\'an; Francisco D. Igual; Rafael Mayo; Rafael; Rodr\'iguez-S\'anchez; Enrique S. Quintana-Ort\'i

arXiv:1506.08988·cs.PF·July 1, 2015

Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

Sandra Catal\'an, Francisco D. Igual, Rafael Mayo, Rafael, Rodr\'iguez-S\'anchez, Enrique S. Quintana-Ort\'i

PDF

Open Access

TL;DR

This paper presents architecture-aware optimizations for matrix multiplication on ARM big.LITTLE asymmetric multicore processors, improving performance and energy efficiency through cache-aware configuration and asymmetric scheduling strategies.

Contribution

It introduces a high-performance, energy-efficient GEMM implementation tailored for ARM big.LITTLE architectures using cache-aware and asymmetric scheduling techniques.

Findings

01

Significant performance improvements over architecture-oblivious implementations.

02

Enhanced energy efficiency on ARM big.LITTLE processors.

03

Effective utilization of heterogeneous cores for scientific computing.

Abstract

Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Advanced Data Storage Technologies