Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and   Many-Core Systems

Karan Aggarwal; Uday Bondhugula

arXiv:1905.06234·cs.DC·July 25, 2019

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Karan Aggarwal, Uday Bondhugula

PDF

TL;DR

This paper presents optimized algorithms for sparse matrix-vector multiplication in neuroscience applications, achieving significant speedups on multi-core CPUs and GPUs by combining platform-independent and platform-specific techniques.

Contribution

It introduces a comprehensive optimization framework for SpMV operations in tensor-decomposed datasets, improving performance on both CPUs and GPUs.

Findings

01

CPU speedup of 27.12x on 16-core systems

02

GPU speedup of 5.2x over reference implementations

03

Effective optimization techniques for irregular sparse data

Abstract

Sparse matrix-vector multiplication (SpMV) operations are commonly used in various scientific applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it challenging to optimize for multi-cores and GPUs. Computational neuroscience algorithms often involve sparse datasets while still performing long-running computations on them. The LiFE application is a popular neuroscience algorithm used for pruning brain connectivity graphs. The datasets employed herein involve the Sparse…

Tables10

Table 1. Table 1. System details

Microarchitecture	Intel Skylake
Processors	2-socket Intel Xeon Silver 4110
Clock	2.10 GHz
Cores	16 (8 per socket)
Hyperthreading	disabled
Private caches	64 KB L1 cache, 1024 KB L2 cache
Shared cache	11,264 KB L3 cache
Memory	256 GB DDR4 (2.4 GHz)
Microarchitecture (GPU)	NVIDIA Turing
GPU	NVIDIA GeForce RTX 2080 Ti
Multiprocessors (SMs)	64
CUDA cores (SPs)	4352
GPU Base Clock	1350 Mhz
L1 cache/shared memory	96 KB
L2 cache size	5.5 MB
Memory size	11.26 GB GDDR6
Memory bandwidth	616 GB/s
Matlab version	9.5.0.944444 (R2018b)
MRtrix version	3.0
CUDA/NVCC version	10.0
NVCC version	10.0.130
Compiler	GNU C/C++ (gcc/g++) 6.3.0
Compiler flags	-O3 -ptx
OS	Linux kernel 3.10.0 (64-bit) (CentOS 7)

Table 2. Table 2. Execution time for CPU-naive implementation of the SpMV operations for various data restructuring methods on Intel Xeon processor.

Iterations	SpMV operation
	DSC		WC
	Atom	Voxel	Atom	Voxel
1	8.187s	8.214s	6.320s	6.490s
100	8.490s	8.228s	6.316s	6.496s
200	8.484s	8.227s	6.315s	6.500s
300	8.465s	8.231s	6.283s	6.478s
400	8.459s	8.210s	6.286s	6.464s
500	8.452s	8.767s	6.301s	6.463s

Table 3. Table 3. Execution time of CPU-naive-par-withoutBLAS implementation of SpMV for different computation partitioning + data restructuring combinations on Intel Xeon processor.

	SpMV operation
Iterations	DSC		WC
	Voxel+Voxel	Coeff+Voxel	Voxel+Voxel	Atom+Atom	Coeff+Voxel	Coeff+Atom
1	2.553s	2.040s	0.905s	0.957s	0.720s	0.678s
100	2.663s	1.785s	0.814s	0.904s	0.636s	0.682s
200	2.451s	1.778s	0.877s	0.955s	0.640s	0.666s
300	2.471s	1.787s	0.834s	1.001s	0.645s	0.673s
400	2.412s	1.778s	0.841s	0.905s	0.646s	0.665s
500	2.407s	1.783s	0.811s	0.907s	0.660s	0.667s

Table 4. Table 4. Execution time of CPU-opt implementation of SpMV for different computation partitioning + data restructuring combinations on Intel Xeon processor.

	SpMV operation
Iterations	DSC		WC
	Voxel+Voxel	Coeff+Voxel	Voxel+Voxel	Atom+Atom	Coeff+Voxel	Coeff+Atom
1	0.534s	0.486s	0.510s	0.545s	0.482s	0.382s
100	0.147s	0.133s	0.496s	0.533s	0.442s	0.379s
200	0.124s	0.113s	0.496s	0.524s	0.432s	0.376s
300	0.126s	0.111s	0.503s	0.534s	0.428s	0.375s
400	0.139s	0.112s	0.500s	0.538s	0.446s	0.375s
500	0.122s	0.111s	0.498s	0.559s	0.426s	0.391s

Table 5. Table 5. Total execution time (in min) up till different iterations of the SBBNNLS algorithm for a different number of cores on Intel Xeon processor. The baseline is CPU-naive version.

Code	Iters	Execution time (in min)						Speedup
			2	4	8	12	16	1	2	4	8	12	16
CPU-naive	10	5.24	5.73	3.14	1.85	1.37	1.11	1.00	0.91	1.66	2.82	3.81	4.73
	100	45.1	54.5	30.1	17.9	13.1	10.5	1.00	0.82	1.49	2.51	3.42	4.29
	500	225	271	149	88.6	65.6	52.2	1.00	0.82	1.51	2.53	3.42	4.30
CPU-opt	10	1.91	1.22	0.66	0.41	0.35	0.31	2.74	4.29	7.86	12.5	14.9	17.2
	100	14.3	8.86	4.76	2.78	2.22	1.97	3.13	5.08	9.46	16.1	20.3	22.8
	500	61.8	39.1	20.8	12.1	9.43	8.29	3.63	5.74	10.8	18.5	23.8	27.2

Table 6. Table 6. Execution time of Ref-opt implementation of the SpMV operations for various data restructuring techniques on NVIDIA GPU.

Iterations	SpMV operation
	DSC		WC
	Atom	Voxel	Atom	Voxel
1	1.025s	2.087s	0.310s	0.311s
100	0.185s	0.219s	0.316s	0.320s
200	0.166s	0.190s	0.319s	0.320s
300	0.162s	0.187s	0.319s	0.320s
400	0.162s	0.186s	0.319s	0.320s
500	0.162s	0.186s	0.319s	0.320s

Table 7. Table 7. Execution time of Ref-opt implementation of the SpMV operations for different computation partitioning + data restructuring combinations on NVIDIA GPU.

	SpMV operation
Iter(s)	DSC		WC
	Voxel+Voxel	Coeff+Voxel	Voxel+Voxel	Atom+Atom	Coeff+Voxel	Coeff+Atom
1	0.318s	1.025s	0.311s	0.313s	0.188s	0.122s
100	0.057s	0.185s	0.318s	0.320s	0.184s	0.121s
200	0.053s	0.166s	0.321s	0.320s	0.184s	0.121s
300	0.052s	0.162s	0.321s	0.320s	0.184s	0.121s
400	0.052s	0.162s	0.321s	0.320s	0.182s	0.121s
500	0.052s	0.162s	0.322s	0.320s	0.184s	0.120s

Table 8. Table 8. Execution time of the GPU-opt implementation of the SpMV operations for different computation partitioning + data restructuring combinations on NVIDIA GPU.

	SpMV operation
Iter(s)	DSC		WC
	Voxel+Voxel	Coeff+Voxel	Voxel+Voxel	Atom+Atom	Coeff+Voxel	Coeff+Atom
1	0.041s	2.431s	0.074s	0.069s	0.049s	0.057s
100	0.017s	0.141s	0.064s	0.065s	0.047s	0.044s
200	0.015s	0.094s	0.064s	0.064s	0.047s	0.044s
300	0.015s	0.089s	0.064s	0.065s	0.047s	0.044s
400	0.015s	0.089s	0.064s	0.065s	0.047s	0.044s
500	0.015s	0.089s	0.065s	0.065s	0.047s	0.044s

Table 9. Table 9. Execution time (in minutes) of the SBBNNLS algorithm for various tractography algorithms using STN96 dMRI data (with 𝐍 θ = 𝟗𝟔 subscript 𝐍 𝜃 96 \bf N_{\theta}=96 ).

Fascicles	Tractography	Voxels	$Φ$ size	CPU-naive	CPU-opt	Ref-opt	GPU-opt
50000	DET	151414	510.0 MB	16.8m	1.17m	0.555m	0.146m
	PROB	162499	522.9 MB	20.7m	1.71m	0.972m	0.157m
	iFOD1	212874	726.7 MB	49.7m	2.93m	1.595m	0.331m
	SD_STREAM	195066	497.2 MB	12.9m	1.13m	0.535m	0.118m
	FACT	138860	372.8 MB	7.10m	0.68m	0.319m	0.084m
100000	DET	161443	688.1 MB	30.3m	1.76m	1.102m	0.232m
	PROB	173685	692.8 MB	40.9m	2.16m	1.428m	0.244m
	iFOD1	231586	1.020 GB	1h47m	5.03m	2.722m	0.557m
	SD_STREAM	217742	617.9 MB	24.3m	1.61m	0.764m	0.170m
	FACT	161120	457.2 MB	13.2m	1.00m	0.506m	0.117m
150000	DET	165843	858.8 MB	45.8m	2.32m	1.391m	0.310m
	PROB	178984	851.6 MB	50.1m	2.81m	1.830m	0.318m
	iFOD1	239522	1.321 GB	2h27m	7.53m	3.631m	0.747m
	SD_STREAM	227416	721.1 MB	35.8m	2.12m	0.930m	0.216m
	FACT	171782	520.8 MB	19.4m	1.33m	0.641m	0.130m
200000	DET	168608	1.001 GB	59.0m	2.71m	1.644m	0.387m
	PROB	182302	1006 MB	1h19m	4.21m	2.232m	0.396m
	iFOD1	244265	1.611 GB	3h20m	9.27m	4.345m	0.950m
	SD_STREAM	233403	818.5 MB	47.1m	2.49m	1.124m	0.262m
	FACT	178779	579.0 MB	25.4m	1.51m	0.720m	0.156m
250000	DET	170403	1.171 GB	1h14m	3.37m	1.852m	0.459m
	PROB	184613	1.132 GB	1h56m	4.82m	2.616m	0.471m
	iFOD1	247356	1.905 GB	4h09m	10.9m	5.798m	1.202m
	SD_STREAM	237399	915.4 MB	58.8m	2.94m	1.288m	0.304m
	FACT	183885	633.8 MB	31.7m	1.83m	0.812m	0.190m
500000	DET	175351	1.970 GB	2h42m	5.76m	3.039m	0.829m
	PROB	190589	1.871 GB	3h52m	8.71m	4.485m	0.859m
	iFOD1	255309	3.362 GB	6h05m	21.1m	9.009m	2.155m
	SD_STREAM	247291	888.7 MB	1h56m	4.85m	2.070m	0.528m
	FACT	197299	1.024 GB	1h02m	3.08m	1.249m	0.301m

Table 10. Table 10. Execution time (in minutes) up till different iterations of the SBBNNLS for various code implementations running on CPU and GPU.

Iterarions	Execution time (minutes)					Speedup over
Iterarions	CPU-naive	CPU-opt	Ref-opt	ReAl-LiFE	GPU-opt	CPU-naive	CPU-opt	Ref-opt	ReAl-LiFE	GPU-opt
10	5.241	0.304	0.421	0.035	0.025	1.0	17.24	12.48	150.60	209.64
100	45.07	1.978	1.344	0.318	0.186	1.0	22.79	33.54	141.41	242.35
500	224.8	8.294	4.393	1.603	0.855	1.0	27.12	51.21	140.23	263.06

Equations32

\scalebox{1.0}{$\bf y\approx Mw$},

\scalebox{1.0}{$\bf y\approx Mw$},

\scalebox{1.0}{$\underset{\bf w}{\bf min}(\frac{1}{2}\left\|\left(\bf y-Mw\rm\rm\right)\right\|^{2})$ such that $w_{f}\geqslant 0,\forall f$}.

\scalebox{1.0}{$\underset{\bf w}{\bf min}(\frac{1}{2}\left\|\left(\bf y-Mw\rm\rm\right)\right\|^{2})$ such that $w_{f}\geqslant 0,\forall f$}.

\scalebox{1.0}{$\bf M_{v}=S_{0}(v){D\Phi_{v}}$},

\scalebox{1.0}{$\bf M_{v}=S_{0}(v){D\Phi_{v}}$},

\scalebox{1.0}{$\bf Y=\Phi\times_{1}D\times_{2}S_{0}\times_{3}w^{T}$},

\scalebox{1.0}{$\bf Y=\Phi\times_{1}D\times_{2}S_{0}\times_{3}w^{T}$},

\scalebox{1.0}{$w^{(i+1)}=[w^{(i)}-\alpha^{(i)}\nabla g(w^{(i)})]_{+}$},

\scalebox{1.0}{$w^{(i+1)}=[w^{(i)}-\alpha^{(i)}\nabla g(w^{(i)})]_{+}$},

\scalebox{1.0}{ $\nabla g(w)=M^{T}(Mw-y)$},

\scalebox{1.0}{ $\nabla g(w)=M^{T}(Mw-y)$},

\scalebox{1.0}{ $\alpha^{(i)}=\dfrac{\langle\nabla\tilde{g}(w^{(i-1)}),\nabla\tilde{g}(w^{(i-1)})\rangle}{\langle M\nabla\tilde{g}(w^{(i-1)}),M\nabla\tilde{g}(w^{(i-1)})\rangle}$},

\scalebox{1.0}{ $\alpha^{(i)}=\dfrac{\langle\nabla\tilde{g}(w^{(i-1)}),\nabla\tilde{g}(w^{(i-1)})\rangle}{\langle M\nabla\tilde{g}(w^{(i-1)}),M\nabla\tilde{g}(w^{(i-1)})\rangle}$},

\scalebox{1.0}{ $\alpha^{(i)}=\dfrac{\langle M\nabla\tilde{g}(w^{(i-1)}),M\nabla\tilde{g}(w^{(i-1)})\rangle}{\langle M^{T}M\nabla\tilde{g}(w^{(i-1)}),M^{T}M\nabla\tilde{g}(w^{(i-1)})\rangle}$}.

\scalebox{1.0}{ $\alpha^{(i)}=\dfrac{\langle M\nabla\tilde{g}(w^{(i-1)}),M\nabla\tilde{g}(w^{(i-1)})\rangle}{\langle M^{T}M\nabla\tilde{g}(w^{(i-1)}),M^{T}M\nabla\tilde{g}(w^{(i-1)})\rangle}$}.

w^{(i + 1)} = [w^{(i)} - α^{(i)} w^{^{'}}]_{+}

w^{(i + 1)} = [w^{(i)} - α^{(i)} w^{^{'}}]_{+}

y = (M w^{(i)} - b)

y = (M w^{(i)} - b)

w^{^{'}} = M^{T} y

w^{^{'}} = M^{T} y

v^{^{'}} = M w^{^{'}}

v^{^{'}} = M w^{^{'}}

α^{(i)} = \frac{⟨ w ^{^{'}} , w ^{^{'}} ⟩}{⟨ v ^{^{'}} , v ^{^{'}} ⟩}

α^{(i)} = \frac{⟨ w ^{^{'}} , w ^{^{'}} ⟩}{⟨ v ^{^{'}} , v ^{^{'}} ⟩}

v^{^{'}} = M w^{^{'}}

v^{^{'}} = M w^{^{'}}

v^{^{''}} = M^{T} v^{^{'}}

v^{^{''}} = M^{T} v^{^{'}}

α^{(i)} = \frac{⟨ v ^{^{'}} , v ^{^{'}} ⟩}{⟨ v ^{^{''}} , v ^{^{''}} ⟩}

α^{(i)} = \frac{⟨ v ^{^{'}} , v ^{^{'}} ⟩}{⟨ v ^{^{''}} , v ^{^{''}} ⟩}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning · TuckER

Full text

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Karan Aggarwal

Dept of CSAIndian Institute of ScienceBengaluru560012India

[email protected]

and

Uday Bondhugula

Dept of CSAIndian Institute of ScienceBengaluru560012India

[email protected]

Abstract.

Sparse matrix-vector multiplication (SpMV) operations are commonly used in various scientific and engineering applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations and optimization techniques have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it challenging to optimize for multi-cores and even more demanding for the massively parallel architectures, such as on GPUs.

Computational neuroscience algorithms often involve sparse datasets while still performing long-running computations on them. The Linear Fascicle Evaluation (LiFE) application is a popular neuroscience algorithm used for pruning brain connectivity graphs. The datasets employed herein involve the Sparse Tucker Decomposition (STD) — a widely used tensor decomposition method. Using this decomposition leads to multiple indirect array references, making it very difficult to optimize on both multi-core and many-core systems. Recent implementations of the LiFE algorithm show that its SpMV operations are the key bottleneck for performance and scaling. In this work, we first propose target-independent optimizations to optimize these SpMV operations, followed by target-dependent optimizations for CPU and GPU systems. The target-independent techniques include: (1) standard compiler optimizations to prevent unnecessary and redundant computations, (2) data restructuring techniques to minimize the effects of indirect accesses, and (3) methods to partition computations among threads to obtain coarse-grained parallelism with low synchronization overhead. Then we present the target-dependent optimizations for CPUs such as: (1) efficient synchronization-free thread mapping, and (2) utilizing BLAS calls to exploit hardware-specific speed. Following that, we present various GPU-specific optimizations to optimally map threads at the granularity of warps, thread blocks and grid. Furthermore, to automate the CPU-based optimizations developed for this algorithm, we also extend the PolyMage domain-specific language, embedded in Python. Our highly optimized and parallelized CPU implementation obtain a speedup of $6.3\times$ over the naive parallel CPU implementation running on 16-core Intel Xeon Silver (Skylake-based) system. In addition to that our optimized GPU implementation achieves a speedup of $5.2\times$ over a reference optimized GPU code version on NVIDIA’s GeForce RTX 2080 Ti GPU, and a speedup of $9.7\times$ over our highly optimized and parallelized CPU implementation.

SpMV, Indirect array accesses, Connectome, Tractography, Fascicle, dMRI, LiFE Algorithm, Tensor decomposition, Sparse Tucker Decomposition, Non-negative least square, SBBNNLS, Multi-core, GPU, PolyMage

††ccs: Mathematics of computing Computations on matrices††ccs: Computing methodologies Shared memory algorithms††ccs: Applied computing Biological networks††ccs: Applied computing Imaging

1. Introduction

Sparse matrix-vector multiplication (SpMV) is a key operation in many scientific and engineering applications. As SpMV is typically memory bandwidth and latency bound, it plays a significant role in determining the overall execution time as well as the scalability of an application. Utilizing the architecture-specific memory model to reduce its memory bandwidth requirement is a major challenge, especially for highly parallel architectures such as GPUs, where exploiting the regularity in unstructured accesses is key. Numerous prior works have been proposed to improve the performance of SpMV, including that of the development of new sparse representations 59; 13; 88, representation-specific optimizations 13; 12; 36 and architecture-specific techniques 9; 13; 54; 62; 99; 101; 103; 81.

Tensor decomposition 47 is a popular technique to represent the LHS matrix in SpMV as a combination of a tensor and other auxiliary data structures in a way that drastically reduces the amount of storage. Tensor decomposition has found use to perform SpMV operations efficiently across many domains such as digital signal processing 25; 82; 49; 50; 51, machine learning 82, data mining 72; 86; 87; 5; 6, computational biology 53; 19; 69; 67; 68; 3; 4; 97; 61; 65; 11 and several more mentioned by Kolda and Bader 47. Tucker et al. 91 presented a widely used tensor decomposition technique based on high-order singular value decomposition. Tucker’s technique is used in a range of applications 106; 105; 74; 47. More importantly, the Tucker model is used to perform low-rank decomposition of tensors to depict the sparse representations of matrices, and this is commonly referred to as the Sparse Tucker Decomposition (STD) 91. The major challenge for an STD-based application however is that the sparse representation entails multiple indirect array accesses. Therefore, efficiently utilizing multi-core and many-core architectures poses a significant difficulty because such accesses are both memory latency and bandwidth unfriendly. However, employing STD for an SpMV operation is a necessary trade-off considering the reduction in memory utilization obtained for a sparse matrix.

Building brain connectivity graphs or the wiring diagram of neural circuitry of the brain, termed as connectome, is an exciting computational neuroscience conundrum involving large but sparse matrices. Understanding the neural pathways is key to studying the connection between brain-regions and behavior. Principally, a connectome can be described at various scales based on the spatial resolution 83; 100; 63. The scales can be primarily categorized as microscale, mesoscale and macroscale 44. A microscale connectome is a neuron-to-neuron brain graph involving $10^{11}$ nodes (neurons) and $10^{17}$ edges (neuronal connection); currently, obtaining and processing such large data appears infeasible. A mesoscale connectome building technique is based on anatomical properties of the brain, which again is not a viable choice due to poor resolution of electron-microscopy 41; 16. Once technology is enhanced, optimizing such large sparse datasets will still be a formidable problem. In contrast, a macroscale level connectome 28 divides a brain model into 3D volumes called voxels (in the order of $10^{6}$ in number); this is thus a much more tractable approach.

Diffusion-weighted Magnetic Resonance Imaging (dMRI) is a popular macroscale choice, that captures the diffusion of water molecules in the brain. The dMRI along with tractography techniques can be used to estimate white matter connectivity in the human brain. These pathways represent physical connections between brain regions and when analyzed in conjunction with behaviour, can provide interesting insights into brain-behaviour relationships. These insights are often essential in diagnosing diseases of the brain such as Alzheimer’s Disease 70, a neurodegenerative disorder involving degradation of white matter. While the non-invasive nature of dMRI enables studying structural connectivity in-vivo in humans, it suffers from a major limitation in that the validity of the results cannot be tested easily due to the lack of access to ground truth 39; 60. Data acquisition protocols and tractography approaches often depend on the specific scientific questions being addressed and can differ significantly across cohorts. Thus, a standardized evaluation technique to assess connectomes and establish evidence for white matter pathways is critical for accurate and reliable estimation of structural connectivity in the brain.

One such technique that addresses these shortcomings is the Linear Fascicle Evaluation (LiFE) 77; 18; 19, an algorithm that prunes white matter connectomes to produce an optimized subset of fibers that best explain the underlying diffusion signal. LiFE posits that the diffusion signal in a voxel (a volume of brain tissue) can be approximated by a weighted sum of the individual contribution of every streamline traversing that voxel. The model thus entails a simple constrained optimization problem where the weights associated with every streamline are estimated by minimizing the error between the measured and predicted diffusion signal. This optimization is carried out using a variant of the gradient descent method - the Subspace Barzilai-Borwein non-negative least squares (SBBNNLS) algorithm 45, and involves iterative matrix multiplications. However, large execution times and memory requirements have precluded the large-scale use of the LiFE algorithm. While the memory issues have recently been addressed with the use of sparse representations (Sparse Tucker Decomposition 91) of the data, the matrix-vector multiplications, transformed to a more complex sequence of operations as presented by Pestilli and Caiafa 18 are still computationally demanding, involving multiple indirect array accesses. Optimizing the transformed SpMV operations on both multi-cores and GPUs is a challenging task that is memory latency and bandwidth bound even for low-resolution dMRI datasets.

In literature, several prior works have been proposed to tackle irregular applications for both multi-core and GPU systems such as 7; 55; 84; 95; 93; 94. These approaches use inspector/executor paradigm 7 to exploit regularity in unstructured accesses. One such approach is presented by Venkat et al. 95 to automate the code generation for a particular class of application performing SpMV on GPUs. Other studies show various compiler transformations to reduce the runtime overhead of code generation by the inspector step in 93, and generate optimized code for wavefront parallelization for sparse-matrix representation in 94. These works have presented a semi-automatic approach to analyze the data (using the inspector step) and then generate the optimized code (using the executor step). Note that these works are limited to read non-affine accesses. However, our work targets optimization of the SpMV operations of LiFE, where the sparse matrix is decomposed using the STD technique. The new representation of the matrix involves multiple irregular accesses which includes both read as well as write array access. Therefore, due to presence of such type of accesses, the exiting works will have a high runtime overhead. However, in this work, we present a specific data restructuring method tuned for LiFE with low run-time overhead. Furthermore, the prior works amortizes the overhead due to inspector/executor across the iterations of a loop in a program. In contrast, our work amortizes the overhead due to restructuring across the several runs of the same program along with the iterations of a loop. Additionally, our data restructuring optimization could potentially be generalized and extended to other applications employing STD, although one would have to look for similar or other data patterns. Thus, our work proposes a tailored data restructuring method to tackles indirect access of SpMV operations used in LiFE.

Prior works on optimizing the LiFE application considered distributed systems and GPUs. Gugnani et al. 35 proposed a distributed memory based approach using MPI and OpenMP paradigms to parallelize the SpMV operations of LiFE and obtained a speedup of $8.7\times$ over the original approach. On the other hand, Madhav 57 developed a fast GPU implementation to optimize the SpMV operations of LiFE by incorporating simple optimization techniques. In another work, Kumar et al. 48 proposed a GPU-accelerated implementation for ReAl-LiFE 48, a modification of LiFE application that introduced regularized pruning constraint to build connectomes.

In this work, we optimize the SpMV operations by performing a number of target-independent and target-dependent optimizations. The target optimizations comprises: (1) standard compiler optimizations, (2) various data restructuring methods, and (3) techniques to partition computations among threads. These optimizations can be automated and extended to other applications performing SpMV operations where the matrix is decomposed using STD. The target-dependent optimizations that we propose for multi-core architectures are following: (1) efficient synchronization-free thread mapping, and (2) utilizing BLAS calls, and for the GPUs the optimizations includes optimal techniques to map threads at the granularity of warps, thread blocks and grids. Tailoring these optimizations for the LiFE application, we obtain a speedup of $27.12\times$ for our highly optimized and parallelized CPU code over the original sequential implementation, and speedups of $5.2\times$ and $1.87\times$ for our optimized GPU implementation over a reference optimized GPU implementation (developed by Madhav 57) and over the ReAl-LiFE GPU implementation (tweaked to perform same computations as the LiFE application) respectively. In addition, our work can express the SpMV operation of LiFE in a high-level language and abstract out other information using a domain-specific language (DSL) approach. Using the domain information, we can perform optimizations that provide significant improvements in performance and productivity. As a proof-of-concept, we extend PolyMage 71, a DSL designed for image processing pipelines, to express the key matrix operations in LiFE and automatically generate optimized CPU code to obtain similar performance improvements compared to that of our hand-optimized CPU implementation.

The key contributions of this paper are as follows:

•

We address challenges involved in optimizing SpMV operations of the LiFE application on multi-cores and GPUs by proposing various architecture-agnostic and architecture-dependent optimizations.

•

The target independent optimizations includes: (1) standard compiler optimizations to avoid unnecessary and redundant computations, (2) data restructuring methods to deal with multiple indirect array references that in turn make further optimizations valid and fruitful, and (3) effective partitioning of computations among threads to exploit coarse-grained parallelism while avoiding the usage of an atomic operation.

•

The CPU-specific optimizations comprises: (1) efficient synchronization-free thread mapping method to reduce load imbalance, and (2) mapping to BLAS calls to exploit fine-grained parallelism.

•

The GPU-specific optimizations include: (1) leveraging fine-grained parallelism by utilizing a GPU’s resources such as shared memory and the shuffle instruction, and (2) effectively transforming loops to map iterations in a better way.

•

Then we present new constructs added to the PolyMage DSL to represent a sparse matrix and automatically generate optimized CPU code for the SpMV operations of the LiFE application.

•

We present experimental results and analysis to show the usefulness of the optimizations we incorporated for SpMV of LiFE, and also compare them with the existing implementations.

•

We present experimental results and analysis by varying various LiFE application parameters such as the number of voxels, number of fibers and different tractography techniques used to process the dMRI data for generating a connectome in the LiFE.

The rest of this paper is organized as follows. We provide background on the LiFE application in Section 2. We describe the problem and challenges pertaining to optimizing SpMV computations of LiFE in Section 3. The target-dependent and the target-independent optimizations are described in Section 4. Then we present the constructs developed in the PolyMage DSL to generate an optimized parallelized CPU code for the SpMV operations in Section 5. Section 6 presents details and analysis of experiments we performed by varying various parameters of LiFE, the benefits of each optimization in an incremental manner, and a comparison of various implementations of the SpMV. Related work is discussed in Section 7, followed by conclusions and future works in Section 8.

2. Background

In this section, we introduce the LiFE model, the optimization algorithm, the essential computations involved in this algorithm as well as highlight the bottlenecks which have been addressed in subsequent sections.

2.1. LiFE Algorithm

Given a whole brain connectome obtained from diffusion data, the goal of the LiFE is to retain only those fibers that best predict the underlying diffusion signal. Let the total number of voxels in which the signal is measured be $\bf N_{v}$ . In each voxel, the signal is obtained along multiple non-collinear gradient directions ( $\bf N_{\theta}$ ), and is represented by a vector $\bf y~{}\in~{}\mathbb{R}^{N_{\theta}N_{v}}$ . Further, the contribution of each fiber f traversing voxel v is encoded in an array $\bf M~{}\in~{}\mathbb{R}^{N_{\theta}N_{v}\times N_{f}}$ , where $\bf N_{f}$ is the total number of fibers in the connectome. In each voxel, v, LiFE models the diffusion signal measured along each gradient direction $\theta$ as the weighted sum of the contributions of every fiber traversing v. In other words, a candidate connectome is pruned to obtain optimized connectome that best estimate the underlying diffusion signal. Thus, the signal across all voxels and all gradient directions can be summarized as:

[TABLE]

where $\bf y\in\mathbb{R}^{N_{\theta}N_{v}}$ is a vector containing demeaned diffusion signal for all voxels ( $\bf v$ ) across all the gradient directions ( $\bf\theta$ ). Matrix $\bf M\in\mathbb{R}^{N_{\theta}N_{v}\times N_{f}}$ , contains diffusion signal contribution by each fascicle ( $\bf f$ ) at a voxel ( $\bf v$ ) in all diffusion directions ( $\bf\theta$ ), and the $\bf w\in\mathbb{R}^{N_{f}}$ vector contains the weight coefficients for each streamline fascicle (Figure 1). Equation 1 is used to estimate the weights by minimizing the error, is solved using following non-negative least-squared optimization problem:

[TABLE]

The major challenge in solving Equation 2 is the significantly high memory requirements of the matrix $\bf M$ . Even for small datasets, $\bf M$ can consume about $40$ GB. In another work, the authors of LiFE proposed the ENCODE framework 76, wherein Sparse Tucker Decomposition (STD) 91, a sparse multiway decomposition method to encode brain connectome, was used to reduce the memory consumption by approximately $40\times$ . Using the STD technique, the diffusion signal contribution for a voxel ( $\bf v$ ), $\bf M_{v}\in\mathbb{R}^{N_{\theta}\times N_{f}}$ is represented as:

[TABLE]

where $S_{0}(v)$ is the diffusion signal measured in absence of gradient, $\bf D\in\mathbb{R}^{N_{\theta}\times{N_{a}}}$ is a dictionary matrix for canonical diffusion atoms estimating individual streamline fiber based on their orientation and signal contribution, and $\bf\Phi_{v}\in\mathbb{R}^{N_{a}\times N_{f}}$ is a sparse binary matrix, whose column indicate primary contributing atoms in individual fibers, in that voxel. Thus, an equation for all $v$ can be re-written as:

[TABLE]

where $\Phi\times_{1}D\times_{2}S_{0}$ is 3D representation of matrix $\bf M$ and $\bf\Phi$ is a 3D representation $\forall\bf~{}\Phi_{v}$ , with the goal to minimize the error between $\bf Y$ and $\bf y$ of Equation 1.

The optimization problem of Equation 4 is solved using sub-space Barzilie-Borwein non-negative least squares (SBBNNLS) algorithm 45. Typically, the SBBNNLS algorithm takes more than $500$ iterations to converge, accounting for more than $92\%$ ( $3$ - $12$ h) of the total execution time of LiFE (for the original naive sequential C language code). Given $\bf w_{0}$ as the initial weight vector, for every iteration, the weight vector is updated based on following equation:

[TABLE]

where gradient,

[TABLE]

and the $\alpha^{(i)}$ step value for every even iteration is computed using,

[TABLE]

and for the odd iterations using,

[TABLE]

The Equations 5-8 represent typical computations necessary for SBBNNLS of LiFE, also shown in Algorithm 1. Note that the tilde sign over gradient $\tilde{g}$ and ” $+$ ” subscript in Equation 5 indicates projection to positive space, i.e., negative values are replaced by zeros.

2.2. Matrix Computations using Sparse Tensor Decomposition

The SBBNNLS algorithm involves two compute-intensive SpMV operations involving the matrix $\bf M$ , i.e., $\bf Mw$ and $\bf M^{T}y$ . On an average, every iteration (even or odd iteration) of SBBNNLS requires the $\bf Mw$ operation twice and $\bf M^{T}y$ $1.5$ times. In Figure 2, it is shown how these simple SpMV operations are transformed to a complex sequence of operations once the matrix $\bf M$ is decomposed to a sparse format using STD. The sparse tensor ( $\bf\Phi$ ) stores non-zero indices, (atomsPtr, voxelsPtr and fibersPtr), along with the values vector (valuesPtr). In Figure 3, one can observe that the three indirection vectors of the $\bf\Phi$ tensor — atomsPtr, voxelsPtr and fibersPtr, redirects to the dictionary matrix DPtr, demeaned diffusion signal vector YPtr and weight vector wPtr respectively. The detailed algorithm for $\bf Mw$ and $\bf M^{T}y$ matrix operations are described in 18. The number of iterations of the outermost loop depends on the number of coefficients ( $\bf N_{c}$ ) representing the non-zero indices in the $\bf\Phi$ tensor or the size of the atomsPtr/voxelsPtr/fibersPtr vectors. The number of iterations of the innermost loop depends on the diffusion directions ( $\bf N_{\theta}$ ). Note that the innermost loop of $\bf Mw$ and $\bf M^{T}y$ corresponds to daxpy and dot-product operations respectively. It is also important noting that the wPtr vector is projected to the positive space; hence, the wPtr vector becomes sparser as it is updated after the execution of each iteration of SBBNNLS (negative values are replaced by zeros due to non-negativity property of SBBNNLS).

3. Problem and Challenges

In this section, we discuss problems and challenges associated with optimizing the SpMV operations used in the SBBNNLS algorithm.

3.1. Large dataset

In Equations 5-8, we observe that there are two major SpMV operations involved, namely, $\bf y=Mw$ and $\bf w=M^{T}y$ . The size of the matrix $\bf M$ depends on parameters such as the number of voxels ( $\bf N_{v}$ ), the number of fascicles ( $\bf N_{f}$ ) and the number of diffusion directions ( $\bf N_{\theta}$ ). The number of diffusion direction varies from $10$ - $300$ , voxels range from $10^{5}$ to $10^{6}$ and fibers from $10^{5}$ to $10^{7}$ ; therefore, the memory consumption may range from a few GBs to PBs. Thus, the matrix will typically not fit in commonly used memory systems. The authors of the LiFE application analyzed the connectome matrices and found that they are highly sparse in nature 77; 76. Hence, they proposed a low-rank Sparse Tucker Decomposition (STD) 91 based approach to represent the matrix $\bf M$ in a sparse tensor format and decompose it using domain-specific information. After decomposition, a new challenge of multiple irregular accesses is introduced, and this is discussed later in this section.

3.2. Architecture-specific Challenges

We will discuss some architecture-specific challenges posed in optimizing the SpMV operations of SBBNNLS.

Multi-core architecture:

In multi-core architectures, the processor can execute multiple independent instructions in parallel, hence improving the speed of a program. Shared memory multi-core architectures uses a multi-level cache memory to hide latency and reduce memory bandwidth utilization.

Improving data reuse: Shared memory multi-core architectures uses multi-level cache memory to minimize the delay caused due to memory latency. Hence, the data accessed multiple times should be reused optimally before eviction from the cache memory.

Exploiting coarse-grained parallelism: Coarse-grained parallelism is splitting of large chunk of a program so that the communication is minimized across the core. However, the coarse-grained parallelism requires load balancing so that no core remains idle.

Exploiting fine-grained parallelism: Fine-grained parallelism is spitting small chunks of programs to facilitate load balancing. However, faces a shortcoming of overhead caused due to usage of synchronization barrier.

GPU architecture:

Modern GPUs are massively parallel, multi-threaded, multi-core architectures with a memory hierarchy significantly different from CPUs. Exploiting this parallelism and the various levels of the memory hierarchy on a GPU is key to effectively optimizing the SpMV operations of SBBNNLS.

Exploiting massive parallelism: An appropriate partitioning and mapping of threads to a thread block or a grid is essential to exploit the massive parallelism on GPUs. One of the challenges here is to reduce the overhead of communication across the thread blocks and warps/threads of a thread block.

Efficiently using the GPU memory model: The SMs of a GPU share global memory, whereas local memory is allocated for a single thread. Shared memory is used for sharing data among threads of a thread block. A GPU provides multiple levels in its memory hierarchy to minimize the usage of memory bandwidth.

Coalesced memory accesses: Global memory accesses are grouped such that consecutive threads access successive memory location. When the threads of a warp access memory contiguously, the access is considered fully coalesced otherwise considered partially coalesced access. Coalesced memory accesses helps to reduce memory bandwidth requirement by loading local memory in as few memory transactions.

3.3. Indirect Array Accesses

As discussed in Section 2.2, after STD-based tensor decomposition, the SpMV operations of LiFE have several indirect array accesses. The challenges that arises for CPUs due to unstructured accesses are following: (a) the data reuse is low, hence memory bandwidth is poorly utilized, and (b) the code is executed sequentially to avoid data races that occur due to the dependent accesses. For GPUs, these irregular references (a) hinder the utilization of massive parallelism of GPUs since synchronization and an atomic operation is required to avoid data races, and (b) hamper the usage of various fast GPU memory spaces and coalesced memory accesses. These are thus the main challenges in optimizing the SpMV operations of the LiFE algorithm on general-purpose multi-core and GPU systems.

4. Optimizations

In this section, we discuss details of the techniques we incorporate to optimize the SpMV operations used in the LiFE algorithm. Firstly, we discuss target-independent optimization techniques, followed by target-specific optimizations for parallel architectures such as multi-core and GPU systems. We denote the SpMV operations for computing the diffusion signal ( $\bf y=Mw$ ) with DSC and the weight ( $\bf w=M^{T}y$ ) with WC. Also, in the discussion, wherever we refer to a sub-vector of a vector (Figure 4), it corresponds to any contiguous part of a sorted indirection vector having the same element value.

4.1. Target-independent optimizations

This section introduces target-independent optimizations such as: (1) basic compiler optimizations to avoid unnecessary and redundant computations, (2) various data restructuring methods for inducing a potential regularity in the irregular accessed data; also contributing to make further optimizations valid and fruitful, and (3) different ways to partition computations among parallel threads to effectively exploit parallelism with low synchronization overhead.

4.1.1. Basic Compiler Optimizations:

In this sub-section, we discuss some of the standard compiler optimizations that we incorporate to obtain trivial performance improvement.

Removing redundant computation:

The dictionary matrix (DPtr) and the demeaned diffusion signal matrix (YPtr) are used in the vector format for the SpMV operations (refer to Figure 3). Therefore, to compute the actual offset of these vectors, we multiply the number of diffusion direction ( $\bf N_{\theta}$ ) with the elements of the atomsPtr and voxelsPtr indirection vectors. The original sequential CPU code computes the actual offset for every iteration of the SpMV operations of the SBBNNLS algorithm. However, we removed this redundant computation, by computing it one time before the start of SBBNNLS in the MATLAB code of LiFE. This reduced the overhead of computing the actual offsets for DPtr and YPtr in the SpMV computations.

Loop-invariant code motion:

Loop-invariant code motion optimization is utilized when a code fragment performs the same operation and computes the same output value for the different iterations of a loop, then that code fragment is hoisted out of the loop. In LiFE, the DSC operation computes the product of the weight vector (wPtr) and the values vector (valuesPtr), which remains same for the innermost loop of SpMV operations. Hence, this code fragment is hoisted out and its result is stored in a temporary variable to utilize it across the several iterations of the loop. Thus, this optimization reduced the overhead of computing the invariant-code from several times for the innermost loop to one time.

Strength reduction for arrays:

Some expressions that take more memory and CPU cycles to execute, can be compensated by an equivalent though less expensive expression. In LiFE application, the indirection vectors such as atomsPtr, voxelsPtr and fibersPtr are stored and passed as a double precision data type, and used as an index (after explicit type conversion to integer) for the DPtr, YPtr and wPtr vectors respectively. Thus, to reduce memory consumption and exploit a less expensive expression for these double precision indirection vectors, they are casted to the integer data type. This optimization is incorporated before the start of SBBNNLS in the MATLAB code and utilized across the several iterations of SBBNNLS. In addition to that, this optimization helped to cut down the data transfer overheads on GPUs due to the reduced size of the indirection vectors.

These simple and straightforward optimizations can be incorporated for both the DSC and WC operations without much effort.

4.1.2. Data Restructuring:

The LiFE algorithm is highly irregular due to the presence of multiple indirectly accessed arrays. In Figure 2, we observe that due to the STD-based representation of the matrix $\bf M$ in SpMV, three indirection vectors are involved — atomsPtr, voxelsPtr and fibersPtr, redirecting to the DPtr, YPtr and wPtr vectors respectively. These indirect array accesses procure low data reuse and prove to be a major hindrance in code parallelization as well; thus, they are a major bottleneck in optimizing the SpMV.

After analyzing the sparse datasets of LiFE, we observe that there exist several element values of an indirection vector redirecting to the same location of an indirectly accessed vector. Therefore, this is a potential source to exploit data locality. To utilize this property of the sparse datasets, we restructure the Phi tensor ( $3$ -D sparse representation of $\bf M$ , represented by $\bf\Phi$ ) data based on an indirection vector to leverage regular data access patterns. If the $\bf\Phi$ tensor is restructured based on one of the indirection vectors (for example voxelsPtr), then the other indirection vectors (such as atomsPtr and fibersPtr) are accessed irregularly. Hence, a major challenge in optimizing this irregular application is to identify a near-optimal method to restructure with low runtime overhead. Thus, to achieve high performance for an SpMV operation, we determine the data restructuring to be incorporated at runtime based on the choice of a dimension (such as atom, voxel or fiber). We now discuss different data restructuring choices coupled with their strengths and weaknesses.

Atom-based Data Restructuring:

In the atom-based data restructuring method, we sort the atomsPtr vector, and depending on that, the $\bf\Phi$ tensor is restructured by reordering the voxel, fiber, and values dimensions. This method captures data reuse for the dictionary vector DPtr in both the DSC and WC operations; but it leads to poor data reuse along the other two indirectly accessed dimensions, that is, voxel and fiber.

Voxel-based Data Restructuring:

In the voxel-based data restructuring method, we sort the voxelsPtr vector, and depending on that, the $\bf\Phi$ tensor is restructured by reordering the atom, fiber, and values dimensions. This data restructuring method captures data reuse for the demeaned diffusion signal vector YPtr in the DSC and WC operations; but it leads to poor data reuse along the other two indirectly accessed dimensions, atom and fiber.

Fiber-based Data Restructuring:

In the fiber-based data restructuring method, we reorder fibersPtr, and depending on that, the $\bf\Phi$ tensor is restructured by reordering the atom, voxel, and values dimensions. The fiber-based approach captures data reuse for the wPtr vector. However, this approach loses a chance to capture data reuse for the vectors YPtr and DPtr. By inspection we found that YPtr and DPtr vectors captures a much better regular data access pattern compared to wPtr. Thus, we skip the fiber-based data restructuring for further analysis.

Hybrid Data Restructuring:

Hybrid data restructuring technique is a merger of the atom-based and the voxel-based data restructuring methods. In this technique, we first execute the DSC and WC operations for both the atom-based and the voxel-based restructuring method three times, and based on the average execution time, we select a dimension that achieves better performance for an SpMV operation. Therefore, we obtain data reuse along the atom dimension or the voxel dimension. Then, the $\bf\Phi$ tensor is restructured again by reordering the sub-vectors of the selected dimension, to capture a chance of data reuse along the other dimension (that is, other than the selected dimension). This technique will be useful for very large datasets. However, currently for this method, the performance improvement is almost negligible due to the data access patterns of the low-resolution datasets used by us and additionally, this technique has a high overhead of an additional data restructuring. Hence, we skip the hybrid-based restructuring for further evaluation as we use only low-resolution datasets (having small memory utilization) for our evaluation.

Another advantage of data restructuring besides from that of significant improvements in data reuse due to regular accesses is that the other optimizations to exploit parallelism and reduce synchronization overheads (discussed later in this section) become valid and profitable. Therefore, data restructuring play a key role to optimize the SpMV operations of LiFE.

The data restructuring to be incorporated is dependent on the input dMRI data and other parameters (such as the number of voxels and fibers) along with a tractography algorithm used. Therefore, we automate the determination of the data restructuring at runtime, by choosing a technique having lower average execution time for three runs. We included the data restructuring optimization in the LiFE algorithm’s MATLAB implementation before invoking the SBBNNLS algorithm, so that the overhead ( $3$ - $5\%$ of the total execution time of SBBNNLS) is amortized across several iterations of the non-negative least-squared algorithm. Note that for a different architecture and an SpMV of LIFE, the data restructuring technique that obtains a near-optimal performance may vary.

4.1.3. Computation Partitioning:

Post data restructuring, the other problem in improving performance of the SpMV operations was the usage of an atomic operation, which was required due to parallel threads performing a reduction in the DSC and WC operations (Figure 3). This causes a high synchronization overhead at runtime, detrimental to the exploitation of massive parallelism on multi-cores and GPUs. We note that the communication among threads can be reduced by mapping computations of the outermost loop of SpMV to a single thread based on the coefficient ( $\bf N_{c}$ ) parameter of the LiFE, or on the atomsPtr or the voxelsPtr dimension. Thus, another major challenge in optimizing the SpMV operations is to determine a method to partition computations for effectively exploiting parallelism and further improving the data reuse for the YPtr and DPtr vectors. We discuss various approaches to handle the computations performed by each thread block in addition to their merits and demerits in detail.

Coefficient-based computation partitioning:

In the coefficient-based computation partitioning technique, a single thread handles computations of a single coefficient or in other words single non-zero value of the sparse tensor ( $\bf\Phi$ ). The parallelism provided by multi-cores and GPUs can be effectively used by the coefficient-based technique, but this leads to a loss of data reuse for the YPtr and DPtr vectors. Additionally, as stated in Section 2.2, the wPtr vector is projected to positive space, implying that the negative values are replaced by zeros. This sparse property of wPtr is particularly useful for the DSC operation as a lot of unnecessary computations can be avoided. However, this computation partitioning technique requires usage of an atomic operation due to the reduction of the YPtr and wPtr vectors in the DSC and WC operations respectively. The coefficient-based technique also hinders incorporation of certain other optimizations discussed later in this section.

Atom-based computation partitioning:

In the atom-based computation partitioning technique, computations are partitioned across the threads based on the atom dimension, where each thread handles computations of a particular atom. Therefore, this technique obtains good data reuse for DPtr but lose an opportunity to exploit data reuse for YPtr. Note that the atom-based computation partitioning uses the atom-based data restructuring.

Voxel-based computation partitioning:

In the voxel-based partitioning technique, computations are partitioned across the voxels, where each thread handles computations of one voxel. In this way, the voxel-based partitioning obtains excellent data reuse for YPtr (as it is accessed twice due to reduction) but lose an opportunity to exploit data reuse for DPtr. Note that the voxel-based computation partitioning uses the voxel-based data restructuring.

The disadvantage of using the atom-based and the voxel-based techniques are (1) all iterations associated with a sub-vector of voxel or atom dimension are executed sequentially; therefore, this leads to a loss to fully utilize the sparse property of wPtr, and (2) each thread block handles several iterations depending on the size of a sub-vector, where the size may vary from one to thousands of iterations; hence, this induces load imbalance. Therefore, due to the moderate parallelism of multi-core CPUs, the load imbalance might be more prominent in them. Thus, to tackle the load imbalance in CPUs, we propose a new technique discussed later in Section 4.2.1.2. However, on GPUs, the load imbalance issue does not impact much because the number of iterations of the outermost loop ( $\bf N_{c}$ ) in the SpMV operations is extremely large compared to the maximum possible thread blocks that can be scheduled to even the modern GPUs.

Therefore, this optimization helped in exploiting coarse-grained parallelism with excellent data reuse. We also observed that avoiding the atomic operation improves the performance considerably than taking advantage of the sparse property of wPtr. Thus, by performing experiments on the datasets used by us, we found that for DSC the coefficient-based partitioning is favourable for CPUs and the voxel-based partitioning is favourable for GPUs, whereas for WC the coefficient-based technique is favourable for both CPUs and GPUs.

4.2. Target Specific Optimizations

In this sub-section, we present target-specific optimization techniques to optimize SpMV operation of LiFE on multi-core and GPU architectures.

4.2.1. CPU-specific Optimizations:

Firstly, we discuss benefits and applicability of incorporating target-independent optimizations on CPUs. Then we introduce CPU-specific optimizations such as efficient synchronization-free thread mapping to utilize coarse-grained parallelism with reduced load imbalance and usage of BLAS library calls to exploit fine-grained parallelism.

4.2.1.1 Target-independent optimizations on CPUs:

In Section 4.1, we discussed three target-independent optimizations for SpMV operations of LiFE. The basic compiler optimizations presented are directly applicable to obtain trivial performance improvement on CPUs. The data restructuring optimization helped to enhance data reuse for YPtr and DPtr vectors in SpMV operations, and further assisted to validate parallelism. Next, we presented different ways to partition computations among the parallel threads to exploit coarse-grained parallelism. However, this optimization aggravated the issue of load imbalance for atom-based and voxel-based partitioning, and an issue of high synchronization overhead for the coefficient-based partitioning due to the usage of an atomic operation to avoid data races. It is difficult to improve the load balance for the atom-based and voxel-based partitioning methods; however, for the coefficient-based partitioning, the overhead issue can be addressed if the atomic operation is evaded. Hence, to tackle this issue we propose a CPU-specific optimization, which is discussed next in this section.

4.2.1.2 Efficient Synchronization-free Thread Mapping:

Earlier in Section 4.1.3, we discussed various ways to partition computations to the parallel threads. We concluded that for both the SpMV operations, the atom-based and the voxel-based partitioning techniques were not profitable due to the load imbalance issue. In addition to that, the atom-based and voxel-based methods required an atomic operation due to the reduction of YPtr vector in DSC operation and wPtr vector in WC operation respectively. Whereas, the coefficient-based did not have a prominent load imbalance issue but still it was not profitable due to the usage of an atomic operation.

For WC operation, we observe that for different computation partitioning techniques, the performance is influenced due to the usage of an atomic operation for the reduction of wPtr; although, based on experiments we discovered that the usage of the atomic operation did not deteriorate the performance much. We found that coefficient-based partitioning is the best choice among the other methods because it exhibits a much better load balance. However, for DSC, we observed that there was a significant drop in performance due to the usage of atomic operation (for all the partitioning methods) and the load imbalance issue (for atom and voxel based methods). Thus, using the coefficient-based partitioning method, we tackle this issue by proposing an efficient synchronization-free thread mapping technique to exploit coarse-grained parallelism without the usage of an atomic operation to improve the performance of DSC.

In Figure 5, we observe the usage of coefficient-based splitting technique for the different data restructuring methods for DSC. Figure 5a shows the atom-based restructuring technique reorders the voxelsPtr vector in such a way that there are high chances of data race at runtime; hence, this method exhibits poor performance due to requirement an atomic operation to avoid data race. Figure 5b shows that the voxel-based technique has a low chance of data dependence but cannot be eliminated completely; hence, this technique too requires an atomic operation. However, we found that there only two instances might occur for a sub-vector of the voxelsPtr vector when the voxel-based data restructuring method is employed. These instances are: (1) the entire sub-vector is scheduled to the same thread; hence, it causes no issue due to sequential execution of the iterations of the sub-vector (case 1 of Figure 5b), and (2) the sub-vector is split across the two threads (case 2 of Figure 5b); therefore, for this case an atomic operation is required due to a chance of data dependence at run-time.

To tackle this issue, we ensure that the sub-vector of the voxelsPtr vector is scheduled to the same thread with a low load imbalance. In Figure 5b, we can observe in the case 2 that the sub-vector (with $4$ value) is split across the threads $1$ and $2$ . To avoid any chance of occurrence of conflicting access, the sub-vector has to be scheduled to either of the one thread. If the sub-vector is scheduled to the thread-1 then it will compute two additional computations, whereas if the sub-vector is scheduled to the thread-2 then it will compute only one additional computation. Hence, scheduling the sub-vector to the thread-2 will help to reduce the load imbalance. The small overhead of load imbalance is a necessary trade-off considering the reduction in execution time obtained for parallel execution of the DSC operation without the usage of an atomic operation.

Thus, to exploit the coarse-grained parallelism for the DSC operation without atomic operation and with reduced load imbalance, we proposed an efficient synchronization-free thread mapping using the coefficient-based partitioning and the voxel-based data restructuring method.

4.2.1.3 Mapping to BLAS calls:

Basic linear algebra subroutines (BLAS)††Usage of BLAS calls on Intel platforms have a slightly different result on different runs of the same program due to rounding error. https://github.com/xianyi/OpenBLAS/issues/1627 packages are often hand-optimized to obtain close to peak performance on various hardware. It is thus useful to leverage these automatically in a DSL setting. We make use of optimized BLAS call in the SpMV operations of the SBBNNLS algorithm. BLAS call improved the overall performance of the LiFE algorithm significantly. We discuss usage of a BLAS call in each of the SpMV operations of SBBNNLS.

BLAS call for DSC operation: The code fragment in the innermost loop of DSC (refer to Figure 3a) corresponds to scalar-vector product. We substitute the code fragment with the daxpy BLAS call to obtain significant performance improvement. In the BLAS call, dictionary vector (DPtr) is used as an input vector and the product of a value in the weight vector (wPtr) and the values vector (valuesPtr) is used as a scalar input. The output is used to update the demeaned diffusion signal vector (YPtr).

According to the SBBNNLS stated in Algorithm 1, wPtr is projected to the positive space; hence, due to this property of wPtr the negative values are replaced by zeros. Therefore, the wPtr vector is sparse in nature. Hence, in the DSC operation, if the scalar value obtained from the product vector wPtr and vector valuesPtr is zero then invoking the BLAS call is futile and should be avoided to refrain from unnecessary computations.

BLAS call for the WC operation: The code fragment in the innermost loop of WC (refer to Figure 3b) corresponds to vector-vector dot product. We substitute the code fragment with the dot BLAS call to obtain performance improvement. In dot BLAS call, the YPtr and DPtr vectors are used to update the wPtr vector. However, in contrast to the DSC operation, the execution time remains almost the same throughout SBBNNLS.

Usage of BLAS call provided fine-grained parallelism for the SpMV operations and improved the performance considerably. Particularly, the DSC operation was greatly benefited by the usage of the BLAS call.

To summarize the optimization of SpMV on CPUs, first we performed the target-independent optimizations, followed by the CPU-specific optimizations to obtain a highly optimized CPU code for the SpMV operations of SBBNNLS. We also extended the PolyMage DSL to incorporate all the optimization presented in this section to automatically generate optimized parallelized code involving the sparse representation of the SpMV operations of SBBNNLS and obtained comparable performance to that of the manually optimized version (CPU-opt). We will discuss more on the DSL extension in Section 5. Note that some of the CPU optimizations require runtime data analysis such as the optimization presented in Section 4.2.1.2. Thus, it could not be incorporated for the automated CPU code version and as a result the automated code version could not achieve the similar performance compared to that of the hand-optimized CPU code version.

4.2.2. GPU-specific Optimizations:

Firstly, we discuss benefits and applicability of incorporating target-independent optimizations on GPUs. Then, we present various GPU-specific optimizations to optimally map threads at the granularity of warps, thread blocks and grid to obtain fine-grained parallelism and improved data reuse. We use GPU code developed by Madhav 57, shown in Figure 6, as a reference GPU code version.

4.2.2.1 Target-independent optimizations on GPUs:

In Section 4.1, we discussed a number of target-independent optimizations for SpMV operations. For GPUs, the basic compiler optimizations presented is useful to obtain minor performance. The data restructuring optimization proposed captured enhanced data reuse for YPtr and DPtr vectors, and further aided to legitimize parallelism. Following that, we presented different ways to partition computations among the parallel threads to exploit coarse-grained parallelism. However, this optimization had similar issues for a GPU that we discussed in Section 4.2.1.1 for a CPU; although, the issue of the load balance discussed earlier for a CPU is not prominent for a GPU due to its massive parallelism. Thus, we do not introduce any new optimization to tackle load imbalance issue for the GPUs and take a step forward to exploit fine-grained parallelism in the SpMV operations.

4.2.2.2 Exploiting Fine-grained Parallelism:

The reference optimized GPU approach executes the innermost loop of both the SpMV operations sequentially (Figure 6). It was performed due to the indirect array accesses of the SpMV operations and the concurrent scheduling of multiple iterations of the outermost loop to a single thread block; hence, the innermost loop had to be executed sequentially to avoid a data race. Thus, due to these reasons, the reference GPU approach missed out an important opportunity to exploit fine-grained parallelism. However, with the aid of resources and instructions provided by a GPU architecture, we could exploit fine-grained parallelism; hence, it helps in obtaining substantial performance improvement in both the SpMV operations. We discuss the different techniques to achieve fine-grained parallelism in the DSC and WC.

Shared memory: Shared memory is an on-chip explicitly addressed memory with significantly lower memory latency than local and global memories of GPUs. It is key in reducing memory access time when data accessed by the threads of a thread block exhibit reuse.

In Figure 6a, we notice in the DSC code that the innermost loop (line $15$ ) performing the daxpy operation is executed sequentially. We used shared memory to execute the iterations of the innermost loop in parallel, though with the usage of a synchronization barrier. However, later in Section 4.2.2.3, we will note that the threads can be executed without the employment of a memory fence. The added advantage of using the shared memory is reduced memory bandwidth requirements obtained due to data reuse of YPtr. Also, note that the size of shared memory required depends on the diffusion direction ( $\bf N_{\theta}$ ).

Shuffle instruction: Parallel threads of a thread block share data using shared memory. However, NVIDIA’s Kepler architecture introduced a new warp-level instruction, named, shuffle instruction (SHFL) 29, to be utilized when the data is to be shared directly among the parallel threads of a warp. It leads to a considerable reduction in latency without the use of shared memory.

In Figure 6b, we observe in the WC code that the innermost loop (line $14$ ) performing dot-product operation is executed sequentially. The dot-product involves two sub-operations — (1) multiply corresponding elements of the vectors, which can be performed in parallel, (2) perform a reduction, which is performance bottleneck if performed sequentially. A popular method to perform reductions in GPUs is to use shared memory. This method however is dependent on the size of shared memory and requires the employment of a memory fence, thereby hurting performance. An alternative method is to use the SHFL instruction 29. It helps to share data directly among the parallel threads of a warp, but requires the usage of a synchronization barrier and shared memory, across the warps of a thread block. However, later in Section 4.2.2.3, we will tackle the synchronization bottleneck as well. Using SHFL, we parallelized the dot-product to significantly reduce the execution time of WC.

Thus, in Figure 7b, we can observe that after incorporating the fine-grained parallelization, the innermost loop of an SpMV operation is executed in parallel, where each thread block handles the iterations of a single sub-vector of voxelsPtr. Note that the computations associated with an iteration of the sub-vector are executed in parallel. However, the computations across the iterations are executed sequentially, requiring the syncthread barrier in between the iterations. We tried to replace the daxpy computation in the innermost loop of the DSC code and the dot-product computation in the innermost loop of the WC code with appropriate cuBLAS library calls, but were unsuccessful due to the difficulty in interfacing this from MATLAB.

4.2.2.3 **Reduce Synchronization Overhead by using Warp-based Thread

Execution:**

On NVIDIA GPUs, a warp is a collection of a certain number of threads (typically $32$ ) executing the same code in lock-step and is best used when each thread follows the same execution path. When there are a number of warps sharing data or performing dependent pieces of computation, those pieces need to be synchronized and this could impact performance. As discussed earlier in Section 4.2.2.2, the SpMV operations of the LiFE algorithm face a similar challenge.

In Figure 7b, we observe that the iterations of the innermost loop of the SpMV operations executing in parallel require syncthread barrier across the warps of a thread block. However, by transforming the innermost loop, multiple warps can be replaced by a single warp. Note that the innermost loop parameter depends on $\bf N_{\theta}$ , which is typically a multiple of $32$ for most of the dMRI datasets ( $96$ for dMRI datasets we used). So the innermost loop is transformed such that the $32$ iterations are executed in parallel by a warp, and then the next $32$ iterations are executed in parallel by the same warp, i.e., $\bf N_{\theta}$ / $32$ times sequential execution (as shown in Figure 7c, $\bf N_{\theta}$ =96 requires three sequential executions). The advantage of this change is that we can utilize syncwarp, a much less expensive barrier operation when compared to the syncthread barrier. This will also benefit the next set of optimizations we incorporate to optimize SpMV (discussed later in this section). However, if $\bf N_{\theta}$ is not a multiple of $32$ then the zeros are padded for the YPtr and DPtr vectors to tune their dimensions to a multiple of $32$ . The overhead ( $2$ - $3$ % of the total execution time of SBBNNLS) of padding is low considering that it is amortized across the several iterations of SBBNNLS.

4.2.2.4 Exploiting Additional Data Reuse:

Earlier in Section 4.1.3, we discussed different ways to partition computations of the outermost loop of the SpMV operations among the thread blocks. We scheduled computations of a single coefficient, voxel or atom dimension to a single thread block (Figure 7a); so that the atomic operation hindering the coarse-grained parallelism could be avoided. Despite this optimization, a thread block could not fully utilize the resources allocated by a GPU (such as shared memory and cache memory). The reasons for this were: (1) the size of $\bf N_{\theta}$ is small, and (2) only one warp is scheduled per thread block because of the optimization discussed in Section 4.2.2.3 (shown in Figure 7c).

However, we found that resources allocated for a single thread block could be utilized optimally (Figure 7d) by scheduling multiple computations of coefficients, voxels or atoms could to a single thread block. Thus, this optimization would help to effectively utilize shared memory to exploit an additional data reuse for the YPtr and DPtr vectors, thereby leads to reduction of memory bandwidth consumption. Additionally, the synchronization overhead will also reduce due to the usage of the syncwarp barrier. To obtain near-optimal performance improvements on this aspect, we empirically determined the right number of computations to be scheduled for a thread block. We found that for both the DSC and WC, four computations per thread block provided the near-optimal performance.

4.2.2.5 Loop Unrolling:

Loop unrolling is straightforward and well-known to improve performance by reducing control overhead, providing more instruction scheduling freedom, and increasing register reuse. Using loop unrolling, we achieve an additional performance improvement for the DSC operation. However, a similar performance improvement was not observed for the WC operation because the loop index was static; so the compiler might have automatically unrolled the loop. We determined the unroll factor by performing a few experiments and found eight was optimal unroll factor for the DSC. We used #pragma unroll N (where N is unroll factor) to unroll the loop corresponding to the iterations of the sub-vector of an indirection vector (example voxelsPtr) in the CUDA code of the DSC of SBBNNLS.

To summarize the optimization of SpMV on GPUs, first we performed the target-independent optimizations, followed by the GPU-specific optimizations to obtain a highly optimized GPU code for the SpMV operations of SBBNNLS.

5. Domain-Specific Language Extensions

In this section, we provide a brief overview of the PolyMage DSL and a description of the constructs we added to the DSL, in order to express sparse matrices and the related operations used in the LiFE algorithm.

Bibliography106

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2lic (2017) 2017. Creative Commons Attribution 4.0 license (CC BY 4.0). (2017). http://creativecommons.org/licenses/by/4.0
3Acar et al . (2007 a) Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and Bülent Yener. 2007 a. Multiway analysis of epilepsy tensors. Bioinformatics 23, 13 (July 2007), i 10–i 18.
4Acar et al . (2007 b) Evrim Acar, Canan Aykut Bingol, Haluk Bingol, Rasmus Bro, and Bulent Yener. 2007 b. Seizure Recognition on Epilepsy Feature Tensor. In 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society .
5Acar et al . (2005) Evrim Acar, Seyit A. Çamtepe, Mukkai S. Krishnamoorthy, and Bülent Yener. 2005. Modeling and Multiway Analysis of Chatroom Tensors. In Intelligence and Security Informatics . 256–268.
6Acar et al . (2006) Evrim Acar, Seyit A. Çamtepe, and Bülent Yener. 2006. Collective Sampling and Analysis of High Order Tensors for Chatroom Communications. In Intelligence and Security Informatics . 213–224.
7Arenaz et al . (2005) Manuel Arenaz, Juan Touriño, and Ramón Doallo. 2005. An Inspector-Executor Algorithm for Irregular Assignment Parallelization. In Parallel and Distributed Processing and Applications , Jiannong Cao, Laurence T. Yang, Minyi Guo, and Francis Lau (Eds.). Berlin, Heidelberg, 4–15.
8Balay et al . (2010) S Balay, K Buschelman, Victor Eijkhout, William Gropp, Dinesh Kaushik, Matthew Knepley, L Curfman Mcinnes, B F. Smith, and Hong Zhang. 2010. PET Sc Users Manual Revision 3.1. (01 2010).