Cell processor implementation of a MILC lattice QCD application
Guochun Shi (1), Volodymyr Kindratenko (1), Steven Gottlieb (2) ((1), University of Illinois, (2) Indiana University)

TL;DR
This paper reports on implementing a lattice QCD simulation on the Cell processor, highlighting performance bottlenecks due to memory bandwidth and demonstrating significant speedups over traditional CPUs despite limited kernel performance.
Contribution
First implementation of a MILC lattice QCD application on the Cell processor, analyzing performance bottlenecks and demonstrating notable speedups over standard CPUs.
Findings
Kernel performance limited by memory bandwidth.
Achieved up to 9.6x speedup on a single Cell processor.
Bandwidth utilization close to 78% of peak.
Abstract
We present results of the implementation of one MILC lattice QCD application-simulation with dynamical clover fermions using the hybrid-molecular dynamics R algorithm-on the Cell Broadband Engine processor. Fifty-four individual computational kernels responsible for 98.8% of the overall execution time were ported to the Cell's Synergistic Processing Elements (SPEs). The remaining application framework, including MPI-based distributed code execution, was left to the Cell's PowerPC processor. We observe that we only infrequently achieve more than 10 GFLOPS with any of the kernels, which is just over 4% of the Cell's peak performance. At the same time, many of the kernels are sustaining a bandwidth close to 20 GB/s, which is 78% of the Cell's peak. This indicates that the application performance is limited by the bandwidth between the main memory and the SPEs. In spite of this limitation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
