Phantom-GRAPE: numerical software library to accelerate collisionless   $N$-body simulation with SIMD instruction set on x86 architecture

Ataru Tanikawa; Kohji Yoshikawa; Keigo Nitadori; Takashi Okamoto

arXiv:1203.4037·astro-ph.IM·June 4, 2015

Phantom-GRAPE: numerical software library to accelerate collisionless $N$-body simulation with SIMD instruction set on x86 architecture

Ataru Tanikawa, Kohji Yoshikawa, Keigo Nitadori, Takashi Okamoto

PDF

TL;DR

Phantom-GRAPE is a SIMD-optimized software library that significantly accelerates collisionless N-body force calculations on x86 CPUs, outperforming GPU-based methods especially in large, complex simulations.

Contribution

The paper introduces Phantom-GRAPE, a novel SIMD extension-based library that accelerates force calculations for collisionless N-body simulations, supporting arbitrary force shapes with minimal performance dependence on particle number.

Findings

01

Achieves 2 x 10^9 force interactions/sec for Newtonian forces on a single core.

02

Performance is 20 times faster than non-SIMD implementation and twice faster than SSE.

03

Weak dependence of performance on particle number makes it suitable for large-scale collisionless simulations.

Abstract

(Abridged) We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, AVX, an enhanced version of SSE. In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius r_cut (i.e. f(r)=0 at r>r_cut), can be quickly computed. Using an Intel Core i7--2600 processor, we measure the performance of our library for both the forces. In the case of Newton's forces, we achieve 2 x 10^9 interactions per second with 1 processor core, which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With 4 processor cores, we obtain the performance of 8 x 10^9…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.