
TL;DR
This paper presents an optimized implementation of the Wilson-Dirac solver with Clover term on QPACE, demonstrating high-performance benchmarks and scalability on a parallel Cell processor-based system.
Contribution
It introduces a mixed-precision Schwarz preconditioned FGCR algorithm tailored for QPACE's architecture, improving efficiency and scalability for lattice QCD computations.
Findings
Achieved 10 TFlops performance on 256 nodes
Demonstrated good scalability of the solver
Optimized for multicore and memory constraints
Abstract
We discuss the implementation and optimization challenges for a Wilson-Dirac solver with Clover term on QPACE, a parallel machine based on Cell processors and a torus network. We choose the mixed-precision Schwarz preconditioned FGCR algorithm in order to circumvent network bandwidth and latency constraints, to make efficient use of the multicore parallelism and on-chip memory, and to achieve flexibility in the choice of lattice sizes. We present benchmarks on up to 256 QPACE nodes showing an aggregate sustained performance of about 10 TFlops for the complete solver and very good scaling.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
