Learning-Augmented Performance Model for Tensor Product Factorization in High-Order FEM

Xuanzhengbo Ren; Yuta Kawai; Tetsuya Hoshino; Hirofumi Tomita; Takahiro Katagiri; Daichi Mukunoki; Seiya Nishizawa

arXiv:2601.06886·cs.DC·March 24, 2026

Learning-Augmented Performance Model for Tensor Product Factorization in High-Order FEM

Xuanzhengbo Ren, Yuta Kawai, Tetsuya Hoshino, Hirofumi Tomita, Takahiro Katagiri, Daichi Mukunoki, Seiya Nishizawa

PDF

Open Access

TL;DR

This paper introduces a learning-augmented performance model for tensor product kernels in high-order FEM, improving accuracy over traditional models by capturing instruction-level effects on modern HPC architectures.

Contribution

The paper develops a dependency-chain-based analytical model combined with machine learning to accurately predict kernel performance on diverse HPC architectures.

Findings

01

Outperforms Roofline and ECM models in accuracy.

02

Achieves low MAPE (1-24%) on Fujitsu A64FX.

03

Achieves low MAPE (1-24%) on Intel Xeon Gold 6230.

Abstract

Accurate performance prediction is essential for optimizing scientific applications on modern high-performance computing (HPC) architectures. Widely used performance models primarily focus on cache and memory bandwidth, which is suitable for many memory-bound workloads. However, it is unsuitable for highly arithmetic intensive cases such as the sum-factorization with tensor $n$ -mode product kernels, which are an optimization technique for high-order finite element methods (FEM). On processors with relatively high single instruction multiple data (SIMD) instruction latency, such as the Fujitsu A64FX, the performance of these kernels is strongly influenced by loop-body splitting strategies. Memory-bandwidth-oriented models are therefore not appropriate for evaluating these splitting configurations, and a model that directly reflects instruction-level efficiency is required. To address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Low-power high-performance VLSI design