Optimizing Xeon Phi for Interactive Data Analysis

Chansup Byun; Jeremy Kepner; William Arcand; David Bestor; William; Bergeron; Matthew Hubbell; Vijay Gadepally; Michael Houle; Michael Jones,; Anne Klein; Lauren Milechin; Peter Michaleas; Julie Mullen; Andrew Prout,; Antonio Rosa; Siddharth Samsi; Charles Yee; Albert Reuther

arXiv:1907.03195·cs.PF·December 3, 2019

Optimizing Xeon Phi for Interactive Data Analysis

Chansup Byun, Jeremy Kepner, William Arcand, David Bestor, William, Bergeron, Matthew Hubbell, Vijay Gadepally, Michael Houle, Michael Jones,, Anne Klein, Lauren Milechin, Peter Michaleas, Julie Mullen, Andrew Prout,, Antonio Rosa, Siddharth Samsi, Charles Yee, Albert Reuther

PDF

TL;DR

This paper evaluates how to optimize Xeon Phi for interactive data analysis by tuning settings like OpenMP and memory modes, achieving up to 66% of peak performance in matrix operations.

Contribution

It provides detailed performance results and tuning guidelines for Xeon Phi in data analysis environments like Matlab and Octave.

Findings

01

Achieved 66% of practical peak performance in matrix multiplication.

02

Optimal settings include KMP_AFFINITY, taskset pinning, and all2all cache mode.

03

Performance improvements enabled real-world application success.

Abstract

The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory modes. This paper describes matrix multiplication performance results for Matlab and GNU Octave over a variety of combinations of process counts and OpenMP threads and Xeon Phi memory modes. These results indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance for process counts ranging from 1 to 64 and OpenMP threads ranging from 1 to 64. These settings have resulted in generally…

Equations6

C = AB

C = AB

C = A * B

C = A * B

KMP_AFFINITY = granularity = fine

KMP_AFFINITY = granularity = fine

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Optimizing Xeon Phi for Interactive Data Analysis

Chansup Byun1, Jeremy Kepner1,2,3,

William Arcand1, David Bestor1, William Bergeron1, Matthew Hubbell1, Vijay Gadepally1,2,

Michael Houle1, Michael Jones1, Anne Klein1, Lauren Milechin4, Peter Michaleas1,

Julie Mullen1, Andrew Prout1, Antonio Rosa1, Siddharth Samsi1, Charles Yee1, Albert Reuther1

1MIT Lincoln Laboratory Supercomputing Center, 2MIT Computer Science & AI Laboratory,

3MIT Mathematics Department, 4MIT Department of Earth, Atmospheric and Planetary Sciences

Abstract

The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory modes. This paper describes matrix multiplication performance results for Matlab and GNU Octave over a variety of combinations of process counts and OpenMP threads and Xeon Phi memory modes. These results indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance for process counts ranging from 1 to 64 and OpenMP threads ranging from 1 to 64. These settings have resulted in generally improved performance across a range of applications and has enabled our Xeon Phi system to deliver significant results in a number of real-world applications.

I Introduction

††footnotetext: This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001 and National Science Foundation grants DMS-1312831 and CCF-1533644. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering or the National Science Foundation.

The Intel Xeon Phi 72x0 (KNL - Knights Landing) processor represents an important contribution in a long-line of manycore processors [1, 2, 3, 4] with high-core count ( $\geq$ 64), large number of vector units ( $\geq$ 128), tiled physical layout, and high speed memory combined with significant amounts of DRAM [5, 6] (see Figures 1 and 2). The Xeon Phi is ideally suited to applications that perform many vector operations. Matrix multiplication is a common data analysis operation [7] that is well-suited to the Xeon Phi processor. Mathematically matrix-matrix multiplication is denoted

[TABLE]

where $\mathbf{A}$ is a N $\times$ L matrix, $\mathbf{B}$ is a L $\times$ M matrix, and $\mathbf{C}$ is a M $\times$ N matrix.

Increasingly, data analysis is performed in high-level programming environments that include Matlab, GNU Octave, Julia, Python, and R. These environments allow a programmer to invoke the full power of a processor such as the Xeon Phi with simple, intuitive syntax

[TABLE]

While the above code makes matrix multiplication easy to invoke, there are significant additional tuning and configuration steps necessary to allow such an operation to achieve maximum performance [8, 9, 10, 11, 12, 13, 14]. These steps are often outside the domain of expertise of data analysis programmers and best provided by systems operators. The Lincoln Laboratory Supercomputing Center (LLSC) operates a 648-node Xeon Phi supercomputer. Our focus is on interactive high performance environments so this work explores the steps necessary to allow these environments (Matlab and GNU Octave specifically) to achieve maximum performance on matrix multiplication as invoked by the above Matlab/Octave code syntax.

Our prior work has focused on the interactive launch of thousands of data analysis environments across hundreds of nodes [15, 16, 17, 18, 19]. This paper focuses on the various methods we used to get maximum single node Xeon Phi performance. In particular with respect to OpenMP parameters, process pinning, and memory settings. We have found these settings have resulted in generally improved performance across a range of applications and has enabled our Xeon Phi system to deliver significant results that have enabled a number of real-world applications in health sciences [20], hurricane relief [21], astronomy [22], and cybersecurity [23]. The rest of the paper is organized as follow. First, the effective OpenMP parameters for Matlab and GNU Octave are given. Second, the method for pinning processes to cores is presented. Third, the Xeon Phi memory modes are described. Finally, the integrated overall performance measurements are presented for the different memory modes.

II OpenMP

OpenMP (www.openmp.org) is an application programming interface that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. OpenMP is an important tool used in many math libraries to exploit multiple cores on a shared memory compute node. The maximum parallelism that OpenMP will seek to exploit is often set via the environment variable OMP_NUM_THREADS.

To allow a user to readily control the number of nodes, processes, and OpenMP threads their parallel Matlab/Octave program uses, the LLSC system uses our pMatlab [24] manycore launch infrastructure and its simple interactive parallel launch syntax

[TABLE]

In the above syntax Nnode is the number of compute nodes that the user desires to run on, Nproc is the number of processes (distinct Matlab/Octave instances) per node, and Nthread sets the value of OMP_NUM_THREADS. In this paper, the focus is on single node performance (Nnode=1) and the number processes and OpenMP threads used for any given computation will be denoted Nproc $\times$ Nthread. For a 64 core Xeon Phi, the standard configurations will be 1 $\times$ 64, 2 $\times$ 32, 4 $\times$ 16, 8 $\times$ 8, 16 $\times$ 4, 32 $\times$ 2, and 64 $\times$ 1. If an application can take advantage of more OpenMP threads than cores, that can easily be set. For example, 8 $\times$ 32 would have 8 processes each allocating 32 OpenMP threads, nominally consuming 256 cores. Likewise, for applications where fewer OpenMP threads are optimal, that can also be specified. For example, 8 $\times$ 2 would have 8 processes each allocating 2 OpenMP threads. In general the pMatlab manycore syntax makes it very easy to experiment with different combinations of processes and OpenMP threads to find the best performance. GNU Octave uses the OMP_NUM_THREADS environment variable directly. For Matlab, additional code is run automatically in a pMatlab launch to align Matlab with OMP_NUM_THREADS

—————————-

Nomp = str2num(getenv('OMP_NUM_THREADS'))

if (Nomp > 1)

maxNumCompThreads(Nomp)

end

—————————-

There are a variety of patterns that can be used to map OpenMP threads to processor cores. The KMP_AFFINITY environment variable in the Intel compilers can be used to set these patterns [25]. For nodes that support hyperthreading, the granularity modifier specifies whether to pin OpenMP threads to physical cores (granularity=core) or logical cores (granularity=fine). Using granularity=thread enables distribution of OpenMP threads in a compact and or scatter fashion [26]. For this work

[TABLE]

was used as it prevented Matlab/Octave from over-allocating OpenMP threads to the same processor core as determined by monitoring the compute node with the Linux htop command during execution.

III Process Pinning

The Xeon Phi processor employs a memory hierarchy whereby certain tiles, cores, and hyperthreads share different levels of memory. It can be advantageous to launch processes on the Xeon Phi with an awareness of this memory hierarchy so the underlying OpenMP threads can exploit preferential data locality. In particular, it is good to avoid having OpenMP threads execute on cores that are far away from the data they require to operate. The Linux operating system provides a number of tools for pinning processes to specific logical cores. This work relies on the taskset –cpu-list command to launch Matlab/Octave instances that are pinned to specific logical cores.

The Xeon Phi presents itself to the Linux operating system as 256 cpus (one cpu for each hyperthread). The cpus $p$ , $p$ +64, $p$ +128, and $p$ +192 will be on the same physical processor. Likewise, if $p$ is even, then cpu $p$ +1 will be on the same physical tile. The mapping of four Matlab/Octave instances to the logical core structure of a 32 tile, 64 core, 256 hyperthread Xeon Phi is illustrated in Figure 3. This binding maximizes data locality of the underlying OpenMP threads.

IV Memory Modes

Our Xeon Phi processors have two-level memory hierarchy consisting 16 Gigabytes of faster near memory (MCDRAM) and 192 Gigabytes of slower far memory (DRAM) [27, 28]. The Xeon Phi has a variety of settings for managing its memory. These settings are generally set at compute node boot time.

The faster and smaller near memory has three modes: flat, cache, and hybrid. In flat mode both near and far memory form a single address space. In cache mode the near memory acts as another layer of cache for the far memory. In hybrid mode, half of the fast memory is flat and half is treated as cache.

The memory can also be divided into different NUMA (non-uniform memory access) domains

all2all

cache line addresses are uniformly hashed across the entire memory

hemisphere

cache line addresses are separately hashed into two memory domains

quadrant

cache line addresses are separately hashed into four memory domains

snc-2

sub-NUMA clustering 2 is similar to hemisphere while also exposing each domain for NUMA aware software to exploit

snc-4

sub-NUMA clustering 4 is similar to quadrant while also exposing each domain for NUMA aware software to exploit

Combined, these combinations of memory modes form 15 distinct configurations

•

all2all-cache, all2all-flat, all2all-hybrid

•

hemisphere-cache, hemisphere-flat, hemisphere-hybrid

•

quadrant-cache, quadrant-flat, quadrant-hybrid

•

snc-2-cache, snc-2-flat, snc-2-hybrid

•

snc-4-cache, snc-4-flat, snc-4-hybrid

V Performance

For any particular application, different memory configurations could provide different performance benefits. The Xeon Phi is designed for vector operations of the type found in matrix-matrix multiply. Selecting a configuration that is optimal for this operation provides a good foundation for allowing the Xeon Phi to deliver what it was designed to do. To determine this configuration, 15 Xeon Phi nodes were set in each memory configuration and the Matlab and Octave matrix-matrix multiply performance was measured for various values of Nproc and Nthread.

The performance benchmark consisted of each Matlab/Octave instance creating two N $\times$ N matrices $\mathbf{A}$ and $\mathbf{B}$ of random double precision values and multiplying these to form another N $\times$ N matrix $\mathbf{C}$ . The total number of bytes required for this operation is 3 $\times$ 8 $\times$ N $\times$ N bytes. For these experiments the matrix size N was chosen to be 48000/ $\sqrt{{\sf\small Nproc}}$ so that the total memory used was the same for all configurations (55 Gigabytes). The performance results for Matlab version 2018a are shown in Figure 4. The performance results for GNU Octave version 4.4 are shown in Figure 5. Both Matlab and GNU Octave show similar performance across all memory modes and the performance of the two best modes are (all2all-cache and quadrant-cache) are significantly better than the default mode (all2all-flat). Based on these data, the LLSC Xeon Phi system selected all2all-cache as its default memory mode.

The Xeon Phi 7210 has 128 AVX512 units each capable of performing 16 multiply-accumulate operations per clock cycle. The AVX512 clock cycle in the Xeon Phi 7210 is 1.1 GHz which means that the practical peak performance is 128 $\times$ 16 (flop) $\times$ 1.1 GHz = 2252 Gigaflops. Figures 4 and 5 show that a performance of 1500 Gigaflops is consistently achievable, which is 66% of the practical peak performance of Xeon Phi.

VI Summary

The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis environments such as Matlab, GNU Octave, Julia, Python, and R. Optimizing the performance of matrix operations within these data analysis environments requires tuning Xeon Phi OpenMP settings, process pinning, and memory modes. This paper measured matrix-matrix multiplication performance for Matlab and GNU Octave for different combinations of process counts and OpenMP threads covering all Xeon Phi memory modes. These measurements indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance of the Xeon Phi. Using these settings have provided improved performance across a range of applications and has enabled our Xeon Phi system to deliver impactful results on a number of real-world applications in health sciences [20], hurricane relief [21], astronomy [22], and cybersecurity [23].

Acknowledgement

The authors wish to acknowledge the following individuals for their contributions and support: Bob Bond, Alan Edelman, Charles Leiserson, Dave Martinez, Mimi McClure, Victor Roytburd, and Michael Wright.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. S. Mc Mahon and K. Teitelbaum, “Space-time adaptive processing on the mesh synchronous processor,” in Proceedings of International Conference on Parallel Processing , pp. 734–740, April 1996.
2[2] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, Jae-Wook Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “The raw microprocessor: a computational fabric for software circuits and general-purpose programs,” IEEE Micro , vol. 22, pp. 25–35, March 2002.
3[3] T. G. Mattson, R. Van der Wijngaart, and M. Frumkin, “Programming the intel 80-core network-on-a-chip terascale processor,” in Proceedings of the 2008 ACM/IEEE Conference on Supercomputing , SC ’08, (Piscataway, NJ, USA), pp. 38:1–38:11, IEEE Press, 2008.
4[4] C. Ramey, “Tile-gx 100 manycore processor: Acceleration interfaces and architecture,” in 2011 IEEE Hot Chips 23 Symposium (HCS) , pp. 1–21, Aug 2011.
5[5] A. Sodani, “Knights landing (knl): 2nd generation intel xeon phi processor,” in 2015 IEEE Hot Chips 27 Symposium (HCS) , pp. 1–24, Aug 2015.
6[6] A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro , vol. 36, pp. 34–46, Mar 2016.
7[7] J. Kepner and H. Jananthan, Mathematics of Big Data . MIT Press, 2018.
8[8] J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov, “Hpc programming on intel many-integrated-core hardware with magma port to xeon phi,” Sci. Program. , vol. 2015, pp. 9:9–9:9, Jan. 2015.