Fast hardware-aware matrix-free algorithm for higher-order finite-element discretized matrix multivector products on distributed systems
Gourab Panigrahi, Nikhil Kodali, Debashis Panda, Phani Motamarri

TL;DR
This paper introduces efficient hardware-aware matrix-free algorithms for higher-order finite-element matrix-multivector products on distributed systems, significantly improving performance over traditional methods on CPUs and GPUs.
Contribution
It develops batched, hardware-optimized algorithms for matrix-multivector products, extending existing matrix-free methods to handle multiple vectors simultaneously on distributed architectures.
Findings
Up to 2.8x speedup on GPU nodes for matrix-multivector products.
Up to 4.4x performance improvement on multi-node CPU systems.
Enhanced eigenvalue problem solving with up to 3.0x speedup on distributed systems.
Abstract
Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. This work proposes efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. We address a critical gap in existing matrix-free implementations, which are well suited only for the action of FE discretized matrices on a single vector. We employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we employ strategies to overlap compute and data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Parallel Computing and Optimization Techniques · Tensor decomposition and applications
