A Multilevel Approach for the Performance Analysis of Parallel   Algorithms

Luisa D'Amore; Valeria Mele; Diego Romano; Giuliano Laccetti

arXiv:1901.05836·cs.DC·January 18, 2019

A Multilevel Approach for the Performance Analysis of Parallel Algorithms

Luisa D'Amore, Valeria Mele, Diego Romano, Giuliano Laccetti

PDF

TL;DR

This paper introduces a multilevel framework for analyzing parallel algorithm performance by decomposing algorithms into operators and matrices to reveal parallelism and overheads.

Contribution

It presents a novel multilevel method that models parallel algorithms through operator sets and matrix representations to better understand their performance characteristics.

Findings

01

Decomposition level influences algorithm granularity.

02

Block matrices reveal inherent parallelism.

03

Analysis identifies sources of overheads.

Abstract

We provide a multilevel approach for analysing performances of parallel algorithms. The main outcome of such approach is that the algorithm is described by using a set of operators which are related to each other according to the problem decomposition. Decomposition level determines the granularity of the algorithm. A set of block matrices (decomposition and execution) highlights fundamental characteristics of the algorithm, such as inherent parallelism and sources of overheads.

Equations219

B_{N_{r}} : In_{B_{N_{r}}} \mapsto Out_{B_{N_{r}}},

B_{N_{r}} : In_{B_{N_{r}}} \mapsto Out_{B_{N_{r}}},

B_{N_{r}} \equiv (N_{r}, In_{B_{N_{r}}}, Out_{B_{N_{r}}})

B_{N_{r}} \equiv (N_{r}, In_{B_{N_{r}}}, Out_{B_{N_{r}}})

F = ((E, π, π_{E}), [0, r_{D} - 1] \cdot [0, c_{D} - 1], f)

F = ((E, π, π_{E}), [0, r_{D} - 1] \cdot [0, c_{D} - 1], f)

d_{i, j} ↮ d_{i, s}, \forall s, j \in [0, c_{D} - 1]

d_{i, j} ↮ d_{i, s}, \forall s, j \in [0, c_{D} - 1]

d_{i, j} \leftarrow d_{i - 1, q}, \forall j \in [0, c_{D} - 1],

d_{i, j} \leftarrow d_{i - 1, q}, \forall j \in [0, c_{D} - 1],

i = 0 \sum k - 1 N_{i} \geq N_{r},

i = 0 \sum k - 1 N_{i} \geq N_{r},

D_{k} (B_{N_{r}}) := {B_{N_{0}}, \dots, B_{N_{k - 1}}},

D_{k} (B_{N_{r}}) := {B_{N_{0}}, \dots, B_{N_{k - 1}}},

D_{k} (B_{N_{r}}) \equiv (i = 0 \sum k - 1 N_{i}, I n_{B_{N_{r}}}, O u t_{B_{N_{r}}}) .

D_{k} (B_{N_{r}}) \equiv (i = 0 \sum k - 1 N_{i}, I n_{B_{N_{r}}}, O u t_{B_{N_{r}}}) .

k_{i} = c a r d (D_{k_{i}} (B_{N_{r}})) = c a r d (D_{k_{j}} (B_{N_{q}})) = k_{j}

k_{i} = c a r d (D_{k_{i}} (B_{N_{r}})) = c a r d (D_{k_{j}} (B_{N_{q}})) = k_{j}

\forall B_{N_{s}} \in D_{k_{i}} (B_{N_{r}}) \exists! B_{N_{t}} \in D_{k_{j}} (B_{N_{q}}) : B_{N_{s}} S B_{N_{t}},

\forall B_{N_{s}} \in D_{k_{i}} (B_{N_{r}}) \exists! B_{N_{t}} \in D_{k_{j}} (B_{N_{q}}) : B_{N_{s}} S B_{N_{t}},

D_{k_{i}} (B_{N_{r}}) S D_{k_{j}} (B_{N_{q}}) .

D_{k_{i}} (B_{N_{r}}) S D_{k_{j}} (B_{N_{q}}) .

C o p_{M_{P}} := {I^{j}}_{j \in [0, q - 1]},

C o p_{M_{P}} := {I^{j}}_{j \in [0, q - 1]},

\exists D_{k} (B_{N_{r}}) \in D B_{N_{r}} : \forall B_{N_{j}} \in D_{k} (B_{N_{r}}) \exists I^{j} \in C o p_{M_{P}} : I^{j} [B_{N_{j}}] = S (B_{N_{j}})

\exists D_{k} (B_{N_{r}}) \in D B_{N_{r}} : \forall B_{N_{j}} \in D_{k} (B_{N_{r}}) \exists I^{j} \in C o p_{M_{P}} : I^{j} [B_{N_{j}}] = S (B_{N_{j}})

θ : B_{N_{j}} \in D_{k} (B_{N_{r}}) \in D B_{N_{r}} ⟼ I^{j} \in C o p_{M_{P}} .

θ : B_{N_{j}} \in D_{k} (B_{N_{r}}) \in D B_{N_{r}} ⟼ I^{j} \in C o p_{M_{P}} .

A_{D_{k} (B_{N_{r}}), M_{P}} = {I^{i_{0}}, I^{i_{1}}, ... I^{i_{k}}}

A_{D_{k} (B_{N_{r}}), M_{P}} = {I^{i_{0}}, I^{i_{1}}, ... I^{i_{k}}}

I^{i_{k}} \circ I^{i_{k - 1}} \circ ... \circ I^{i_{0}} [B_{N_{r}}] = S (B_{N_{r}}),

I^{i_{k}} \circ I^{i_{k - 1}} \circ ... \circ I^{i_{0}} [B_{N_{r}}] = S (B_{N_{r}}),

γ : B_{N_{ν}} \in D_{k} (B_{N_{r}}) \in D B_{N_{r}} ⟷ I^{i_{j}} \in A_{D_{k} (B_{N_{r}}), M_{P}}

γ : B_{N_{ν}} \in D_{k} (B_{N_{r}}) \in D B_{N_{r}} ⟷ I^{i_{j}} \in A_{D_{k} (B_{N_{r}}), M_{P}}

A_{k, P}^{i} = {I^{i_{0}}, I^{i_{1}}, ... I^{i_{k}}}, A_{k, P}^{j} = {I^{j_{0}}, I^{j_{1}}, ... I^{j_{k}}}

A_{k, P}^{i} = {I^{i_{0}}, I^{i_{1}}, ... I^{i_{k}}}, A_{k, P}^{j} = {I^{j_{0}}, I^{j_{1}}, ... I^{j_{k}}}

A_{k, P}^{i} = {I^{i_{0}}, I^{i_{1}}, ... I^{i_{k}}}, A_{k, P}^{j} = {I^{j_{0}}, I^{j_{1}}, ... I^{j_{k}}}

A_{k, P}^{i} = {I^{i_{0}}, I^{i_{1}}, ... I^{i_{k}}}, A_{k, P}^{j} = {I^{j_{0}}, I^{j_{1}}, ... I^{j_{k}}}

φ : A_{k, P} \in A L ⟶ D_{k} (B_{N_{r}}) \in D B_{N_{r}},

φ : A_{k, P} \in A L ⟶ D_{k} (B_{N_{r}}) \in D B_{N_{r}},

ϱ (A_{k, P}) = {A_{k, P} \in A L : φ (A_{k, P})

ϱ (A_{k, P}) = {A_{k, P} \in A L : φ (A_{k, P})

C (A_{k, P}) := c a r d (A_{k, P}) = k .

C (A_{k, P}) := c a r d (A_{k, P}) = k .

c a r d (A_{k, P}) = c a r d (D_{k} (B_{N_{r}})) = k, \forall A_{k, P} \in ϱ (A_{k, P}) .

c a r d (A_{k, P}) = c a r d (D_{k} (B_{N_{r}})) = k, \forall A_{k, P} \in ϱ (A_{k, P}) .

A_{k_{i}, P} S A_{k_{j}, P} ⟹ C (A_{k_{i}, P}) = C (A_{k_{j}, P}) = k

A_{k_{i}, P} S A_{k_{j}, P} ⟹ C (A_{k_{i}, P}) = C (A_{k_{j}, P}) = k

r_{E} = r_{D_{k}} and c_{E} = P = c_{D_{k}}

r_{E} = r_{D_{k}} and c_{E} = P = c_{D_{k}}

r_{E} = n \cdot r_{D_{k}} and c_{E} = P = c_{D_{k}} / n

r_{E} = n \cdot r_{D_{k}} and c_{E} = P = c_{D_{k}} / n

S c_{u p} (A_{k_{i}, P}, A_{k_{j}, P}) := \frac{k _{i}}{k _{j}}

S c_{u p} (A_{k_{i}, P}, A_{k_{j}, P}) := \frac{k _{i}}{k _{j}}

S c_{u p} (A_{k_{i}, P}, A_{k_{j}, P}) = \frac{C ( A _{k_{i}, P} )}{C ( A _{k_{j}, P} )}

S c_{u p} (A_{k_{i}, P}, A_{k_{j}, P}) = \frac{C ( A _{k_{i}, P} )}{C ( A _{k_{j}, P} )}

P^{d} (x) = a_{d} x^{d} + a_{d - 1} x^{d - 1} + \dots + a_{0}, a_{d} \neq = 0 \in Π_{d}, x \in ℜ

P^{d} (x) = a_{d} x^{d} + a_{d - 1} x^{d - 1} + \dots + a_{0}, a_{d} \neq = 0 \in Π_{d}, x \in ℜ

S c_{u p} (A_{k, P}, A_{k^{'}, P}) = ξ (N_{r}, μ) \cdot μ^{d - 1}

S c_{u p} (A_{k, P}, A_{k^{'}, P}) = ξ (N_{r}, μ) \cdot μ^{d - 1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Multilevel Approach for the

Performance Analysis of Parallel Algorithms

L. D’Amore

V. Mele

D. Romano

G. Laccetti

University of Naples, Federico II, Naples (IT)

Institute of High Performance Computing and Networking (ICAR), CNR, Naples (IT)

Abstract

We provide a multilevel approach for analysing performances of parallel algorithms. The main outcome of such approach is that the algorithm is described by using a set of operators which are related to each other according to the problem decomposition. Decomposition level determines the granularity of the algorithm. A set of block matrices (decomposition and execution) highlights fundamental characteristics of the algorithm, such as inherent parallelism and sources of overheads.

keywords:

Algorithm, Performance Metrics, Parallelism.

1 Introduction and Motivation

Numerical algorithms are at the heart of the software that enable scientific discoveries. The development of effective algorithms has a tremendous impact on harnessing emerging computer architectures to achieve new science. The mapping problem, first considered in 1980s [7], refers to the implementation of algorithms on a given target architecture which is capable to maximize some performance metrics [4, 5, 27, 28, 31]. Due to the multidimensional heterogeneity of modern architectures, it is becoming increasingly clear that using the performance metrics in a one-size-fits-all approach fails to discover sources of performance degradation that hamper to deliver the desired performance level. We believe that a performance model based on problem-specific features, as well as on mathematical tools to better analyze and understand algorithm behavior, should be developed. The present article attempts to collect our efforts in this area.

We briefly summarize how the performance model we provide in this work originates. We firstly address the basic structural features of algorithms which are dictated by data and operator dependencies [32]. These dependencies refer to relations among computations which need to be satisfied in order to compute the problem solution correctly. The absence of dependencies indicates the possibility of parallel computations. So the study of data dependencies in an algorithm becomes the most critical step in parallelising the computations of the algorithm. Then, in analogy to the graph of dependency between tasks, we introduce the algorithm as a set of operators starting from a predetermined decomposition of the problem described by a suitably defined matrix, called decomposition matrix. The mapping of the algorithm on the computing machine is described by the execution matrix.

1.1 Organization of the article

Section 2 will review basic concepts and definitions useful for setting up the mathematical framework. We define the decomposition matrix; following [32], we describe a parallel algorithm as an ordered set of operators, moreover we give the definition of complexity of the algorithm depending on the number of such operators; finally, we define the execution matrix describing the mapping of the algorithm on the target computing resource. Section 3 focuses on two metrics characterizing the algorithm performance, such as the scale up factor and the speed up. In Section 4 we analyse the performance of parallel algorithms arising from the same problem decomposition. We derive the Generalized Amdhal’s Law and some important upper and lower bounds of the performance metrics. In Section 5 we consider the particular case where the operators of an algorithm have the same execution time (namely, the operators are the usual floating point operations); in other words, we are assuming to get a decomposition at the lowest level of granularity and we derive the standard expressions for the performance metrics. In Section 6 conclusions are drawn.

1.2 Related works

The appropriate mapping depends upon both the specification of the algorithm and the underlying architecture. Firstly, it implies a transformation of the algorithm into an equivalent but more appropriate form. Works on the mapping problem can be classified according to the used representation. Graph based approaches perform transformations on the algorithm and the architecture represented as graphs. In this approach the algorithm is modeled in terms of graphs structures and the mapping in terms of graphs partitions [7]. Linear algebra approaches represent the graph and its data dependencies by a matrix, then transform the graph by performing matrix operations. Language based approaches transform one form of program text into another form, where the target form textually incorporates information about the architecture [24]. Characteristic based approaches represent the algorithm in terms of a set of characteristics which determines the transformations. Included in this category is the work of [29], where a technique which abstracts a computation in terms of its data dependencies is described. The method is based on a mathematical transformation of the index sets and of the data-dependency vectors associated with the given algorithm.

One common issue of the aforementioned approaches is that very often the model used for the representation of the algorithm cannot be explicitly employed for deriving the expression of the algorithm’s performance metrics. On the contrary, performance analysis is often accomplished with automatic tools on a combination of the algorithm and the parallel architecture on which it is implemented (the so-called parallel system), exploiting automating mappings, automatic translations, re-targeting mappings tracing, auto-tuning tools (such as: the PaRSEC runtime system [10], that provides a portable way to automatically adapt algorithms to new hardware trend. Nevertheless, these approaches ignore the properties of the problem decomposition. Instead, our model allows to choose a level of abstraction of the problem decomposition and of the algorithm description which determine the level of granularity of the performance analysis. A set of parameters are used both to describe the problem and to compute speed up, efficiency, cost, overhead, scale up and operating point of the algorithm, starting from the problem decomposition. Metrics and their asymptotic estimates, which represent upper or lower bounds of the algorithm’s performance depend on parameters characterizing the structure of the two matrices, namely their number of rows and columns, and on computing environment parameters, such as the execution time for one floating point operation.

2 Preliminary Concepts and Definitions

We introduce a dependency relationship among component parts of a computational problem, among operators of the algorithm that solves the problem and, finally, among memory accesses of the algorithm. In this way we are able to define two matrices (decomposition and execution) which highlights fundamental characteristics of the algorithm and which are the foundations of the mathematical model we are going to introduce. To this aim we first give some definitions which we refer in this work111It is worth to note that these definitions do not claim to be general. Their aim is to establish the mathematical setting on which we will restrict our attention..

Definition 1.

(Computational Problem) A computational problem $\mathcal{B}_{N_{r}}$ is the mathematical problem specified by an input/output function:

[TABLE]

where $N_{r}$ is the input data size and $r\in\mathbb{N}$ , between the data of and the solution of $\mathcal{B}_{N_{r}}$ .

Therefore, in the following we assume that the computational problem $\mathcal{B}_{N_{r}}$ is identified by the triple:

[TABLE]

Definition 2.

(Similar Computational Problem) Two computational problems, $\mathcal{B}_{N_{r}}$ and $\mathcal{B}_{N_{q}}$ , are said similar if they are specified by the same functional relation and they only differ in the input/output data size. If $\mathcal{B}_{N_{r}}$ and $\mathcal{B}_{N_{q}}$ are similar we write $\mathcal{B}_{N_{r}}\mathscr{S}\mathcal{B}_{N_{q}}$ .

Dividing a computation into smaller computations, some or all of which may potentially be executed in parallel, is the key step in designing parallel algorithms. The parts that a problem is decomposed into often share input, output, or intermediate data. The dependencies usually result from the fact that the output of one part is the input for another. In our mathematical framework the relationship among component parts of a computational problem will be described by the so called decomposition matrix. In order to define this matrix we need to introduce the following algebraic structure

Definition 3.

(Dependency Group) Let $(\mathcal{E},\pi)$ be a group and let $\pi_{\mathcal{E}}$ be a strict partial order relation on $\mathcal{E}$ , which is compatible with $\pi$ . We say that any element of $\mathcal{E}$ , let us say $A$ , depends on an element of $\mathcal{E}$ , let us say $B$ , if $A\pi_{\mathcal{E}}B$ , and we write $A\leftarrow B$ . If $A$ and $B$ do not depend on each other we write $A\nleftarrow B$ . The group $(\mathcal{E},\pi)$ equipped with $\pi_{\mathcal{E}}$ is called dependency group and it is denoted as $(\mathcal{E},\pi,\pi_{\mathcal{E}})$ .

*Remark 1**.*

Since $\pi_{\mathcal{E}}$ is transitive, from Definition 2 it follows that any two elements of $\mathcal{E}$ , let us say $A$ and $B$ , are independent if there is no any relationship between them. In this case we write $A\nleftarrow B$ and $B\nleftarrow A$ , or even $A\nleftrightarrow B$ .

Now we are able to define the dependency matrix on $(\mathcal{E},\pi,\pi_{\mathcal{E}})$ .

Definition 4.

*(Dependency Matrix) Given $(\mathcal{E},\pi,\pi_{\mathcal{E}})$ , the matrix222For simplicity of notation, in the following we will continue to define matrices in the usual sense of matrix calculation; seen as a family, dependency matrix is defined by the triple:

$\mathcal{F}=((\mathcal{E},\pi,\pi_{\mathcal{E}}),[0,r_{D}-1]\cdot[0,c_{D}-1],f)$

where $f$ is an application between $(\mathcal{E},\pi,\pi_{\mathcal{E}})$ and the set of indices. $\mathcal{F}$ , of size $r_{D}\cdot c_{D}$ , whose elements $d_{i,j}\in(\mathcal{E},\pi)$ , are such that $\forall i\in[0,r_{D}-1]$ *

[TABLE]

and $\forall i\in[1,r_{D}-1],\quad\exists q\in[0,c_{D}-1]$ s.t.

[TABLE]

while the others elements are set equal to zero, is said the dependency matrix.

*Remark 2**.*

Matrix $\mathcal{F}$ is unique (through its construction), up to a permutation of elements on the same row. $c_{\mathcal{F}}$ is said the concurrency degree333A similar concept has already been highlighted in [21] of $(\mathcal{E},\pi,\pi_{\mathcal{E}})$ and $r_{\mathcal{F}}$ is the said the dependency degree of $\mathcal{E}$ . Concurrency degree measures the intrinsic concurrency among sub-problems of $(\mathcal{E},\pi,\pi_{\mathcal{E}})$ . It is obtained as the number of columns of $\mathcal{F}$ .

2.1 The Problem Decomposition

Let $S({\mathcal{B}_{N_{r}}})$ denote the solution444Here, for the sake of simplicity, we assume that $S(\mathcal{B}_{N_{r}})$ exists and it is unique. of $\mathcal{B}_{N_{r}}$ .

Definition 5.

(Decomposition of a computational problem) Given $\mathcal{B}_{N_{r}}$ , any finite set of computational problems $\{\mathcal{B}_{N_{i}}\}_{i=0,\ldots,k-1}$ , where $k\in\mathbb{N}$ , such that $\mathcal{B}_{N_{r}}\leftarrow\mathcal{B}_{N_{i}}$ , where $N_{i}<N_{r}$ , and

[TABLE]

is called a decomposition of $\mathcal{B}_{N_{r}}$ . $\mathcal{B}_{N_{i}}$ denotes a sub-problem of $\mathcal{B}_{N_{r}}$ . A decomposition of $\mathcal{B}_{N_{r}}$ , which is denoted as

[TABLE]

defines the computational problem

[TABLE]

The set of all the decompositions of $\mathcal{B}_{N_{r}}$ is denoted as $\mathcal{D}\mathcal{B}_{N_{r}}$ .

Definition 6.

(Similar Decompositions) Given $\mathcal{B}_{N_{r}}\mathscr{S}\mathcal{B}_{N_{q}}$ , two decompositions $D_{k_{i}}(\mathcal{B}_{N_{r}})$ and $D_{k_{j}}(\mathcal{B}_{N_{q}})$ are called similar if

[TABLE]

and

[TABLE]

and we write

[TABLE]

*Remark 3**.*

(Decomposition matrix) In order to capture interactions among component parts (or sub-problems) of $\mathcal{B}_{N_{r}}$ , we use the dependency matrix on $D_{k}(\mathcal{B}_{N_{r}})$ . More precisely, by using Definition 2 we introduce the group $(D_{k}(\mathcal{B}_{N_{r}}),g_{sol})$ where $g_{sol}$ is any application between any two elements $\mathcal{B}_{N_{i}}$ and $\mathcal{B}_{N_{j}}$ of $D_{k}(\mathcal{B}_{N_{r}})$ , equipped with the strict partial order relation $\pi_{D_{k}(\mathcal{B}_{N_{r}})}$ . Then, we construct the (unique) dependency matrix $\mathcal{F}$ corresponding to the decomposition $D_{k}(\mathcal{B}_{N_{r}})$ . In the following we denote this matrix as $M_{D}(D_{k}(\mathcal{B}_{N_{r}}))$ , or $M_{D_{k}}$ for simplicity, and we refer to it as the decomposition matrix. Given $D_{k}(\mathcal{B}_{N_{r}})$ , let $c_{D_{k}}$ denote the number of columns. This is the (unique) concurrency degree of $\mathcal{B}_{N_{r}}$ . Let $r_{D_{k}}$ denote the row number of rows. This is the (unique) dependency degree of $\mathcal{B}_{N_{r}}$ . Concurrency degree measures the intrinsic concurrency among sub-problems of $\mathcal{B}_{N_{r}}$ .

We observe that, if there are not empty elements, the problem $\mathcal{B}_{N_{r}}$ has the highest intrinsic concurrency, hence we give the following

Definition 7.

(Perfectly Decomposed Problems) $\mathcal{B}_{N_{r}}$ is said perfectly decomposed if $\exists D_{k}(\mathcal{B}_{N_{r}})$ and $M_{D}$ such that

$c_{D}>1$ * ;*

2.

$\forall\,i,j$ , $d_{i,j}\neq\emptyset$ .

The next step is to take these parts and assign them (i.e., the mapping step) onto the computing machine. In the next section we introduce the computing environment characterized by the set of logical-operational operators/operations that it is able to apply/execute.

2.2 The computing architecture

We introduce the machine $\mathcal{M}_{P}$ equipped with $P\geq 1$ processing elements with specific logical-operational capabilities such as: basic operations (arithmetic, $\ldots$ ), special functions evaluations ( $\sin,\cos,\ldots$ ), solvers (integrals, equations system, non linear equations $\ldots$ ). These are the computing operators of $\mathcal{M}_{P}$ . In particular, we will use the following characterization of operators of $\mathcal{M}_{P}$ .

Definition 8.

(Computing Operators) The operator $I^{j}$ of $\mathcal{M}_{P}$ is a correspondence between $\mathbb{R}^{s}$ and $\mathbb{R}^{t}$ , where $s,\,t\in\mathbb{N}$ are positive integers.

Given $\mathcal{M}_{P}$ , the set without repetitions

[TABLE]

where $q\in\mathbb{N}$ , characterizes logical-operational capabilities of the machine $\mathcal{M}_{P}$ . Operators, properly organized, provide the solution to $\mathcal{B}_{N_{r}}$ , as stated in the following

Definition 9.

(Solvable Problems) $\mathcal{B}_{N_{r}}$ is solvable in $\mathcal{M}_{P}$ if

[TABLE]

that is, if it exists any relation

[TABLE]

In particular, we say that a decomposition is suited for $\mathcal{M}_{P}$ if $\theta$ is a function. From now on, we consider as solvable any problem $\mathcal{B}_{N_{r}}$ , and as fixed any decompositions $D_{k}(\mathcal{B}_{N_{r}})\in\mathcal{DB}_{N_{r}}$ suited for $\mathcal{M}_{P}$ . 555Note that there is no loss of generality..

We associate execution time $t_{i}$ (measured, for instance, in seconds) to each $I^{i}\in Cop_{\mathcal{M}_{P}}$ in $\mathcal{M}_{P}$ . If $I^{i}\equiv\varnothing$ , we set $t_{\varnothing}=0$ .

2.3 The Algorithm

In the literature, an algorithm is any procedure consisting of finite number of unambiguous rules that specify a finite sequence of operations to reach a solution to a problem or a specific class of problems [23]. Here we define an algorithm as a proper set of operators which solves $\mathcal{B}_{N_{r}}$ , as stated in the following

Definition 10.

(Algorithm) Given $D_{k}(\mathcal{B}_{N_{r}})$ , an algorithm solving $\mathcal{B}_{N_{r}}$ , indicated as

[TABLE]

is a sequence of elements (not necessarily distinct) of $Cop_{\mathcal{M}_{P}}$ , such that 666In the following we use the symbol $\circ$ to denote correspondence composition.

[TABLE]

where $j\in[0,card(Cop_{\mathcal{M}_{P}})-1]$ , and such that there is a bijective correspondence

[TABLE]

Every ordered subset of $A_{D_{k}(\mathcal{B}_{N_{r}}),\mathcal{M}_{P}}$ is a sub-algorithm of $A_{D_{k}(\mathcal{B}_{N_{r}}),\mathcal{M}_{P}}$ .

For simplicity of notations and when there is no ambiguity, we indicate algorithms briefly as $A_{k,P}$ .

Definition 11.

(Equal Algorithms) Two algorithms

[TABLE]

are said equal if $\forall s\in[0,k],\quad I^{i_{s}}\equiv I^{j_{s}}.$

Note that two equal algorithms have the same cardinality.

Definition 12.

(Granularity set of an Algorithm) Given $A_{k,P}$ , the subset $\mathcal{G}(A_{k,P})$ of $A_{k,P}$ made of distinct operators of $A_{k,P}$ defines the granularity set of $A_{k,P}$ . Two algorithms

[TABLE]

have the same granularity if $\mathcal{G}(A_{k,P}^{i})\equiv\mathcal{G}(A_{k,P}^{j})$ .

Let $AL_{\mathcal{B}_{N_{r}}}$ (or simply $AL$ ) be the set of algorithms that solve $\mathcal{B}_{N_{r}}$ , obtained by varying $\mathcal{M}_{P}$ , the number of processing units $P$ and $D_{k}(\mathcal{B}_{N_{r}})\in\mathcal{DB}_{N_{r}}$ . Even if one can easily formulate infinite variations of an algorithm that do the same thing, in the following we assume $AL$ to be finite.

Definition 13.

(The quotient set $\frac{AL}{\varrho}$ ) Let

[TABLE]

be the surjective correspondence which induces on $AL$ an equivalence relationship $\varrho$ of $AL$ in itself, such that

[TABLE]

*The set $\varrho(A_{k,P})$ consists of algorithms of $AL$ associated with the same decomposition $D_{k}(\mathcal{B}_{N_{r}})\in\mathcal{DB}_{N_{r}}$ . $\varrho$ induces the quotient set $\frac{AL}{\varrho}$ , whose elements are disjoints and finite subsets of $AL$ determined by $\varrho$ , that is they are equivalence classes under $\varrho$ .

In the following we assume $A_{k,P}$ to represent its equivalence class in $AL$ .

Definition 14.

(Complexity) The cardinality of $A_{k,P}$ , denoted as $C(A_{k,P})$ , is said complexity of $A_{k,P}$ . It is

[TABLE]

*Remark 4**.*

$C(A_{k,P})=k$ equals to the number of non empty elements of $M_{D_{k}}$ , i.e. the decomposition matrix defined on $D_{k}(\mathcal{B}_{N_{r}})$ . By virtue of the bijective correspondence $\gamma$ in (5), it holds that

[TABLE]

So, each algorithm belonging to the same equivalence class according to $\varrho$ has the same complexity. An integer (the complexity) is therefore associated with each element $\varrho(A_{k,P})$ of quotient set $\frac{AL}{\varrho}$ which induces an ordering relation between the equivalence classes in $\frac{AL}{\varrho}$ : therefore there is a minimum complexity for algorithms that solve the problem $\mathcal{B}_{N_{r}}$ .

*Remark 5**.*

(Similar Algorithms) Given $\mathcal{B}_{N_{r}}\mathscr{S}\mathcal{B}_{N_{q}}$ and their relative similar decompositions $D^{\prime}_{k_{i}}(\mathcal{B}_{N_{r}})\mathscr{S}D^{{}^{\prime\prime}}_{k_{j}}(\mathcal{B}_{N_{q}})$ with $k_{i},k_{j}\in\mathbb{N}$ and $k_{i}=k_{j}=k$ (see Definition 6), algorithms belonging to $\varrho(A_{k_{i},P})=\varphi^{-1}(D^{\prime}_{k_{i}}(\mathcal{B}_{N_{r}}))$ (see (6)) are similar to algorithms belonging to $\varrho(A_{k_{j},P})=\varphi^{-1}(D^{{}^{\prime\prime}}_{k_{j}}(\mathcal{B}_{N_{q}}))$ . From Definition 6 and 14 and the (8), it follows that

[TABLE]

that is similar algorithms have the same complexity.

*Remark 6**.*

As we can associate $I^{i_{k}}\in A_{k,P}$ to each subproblem according to $\gamma$ , then the operators of $A_{k,P}$ inherit the dependencies existing between subproblems of $\mathcal{B}_{N_{r}}$ , but they do not inherit independencies, because for instance, two operators may depend on the availability of computing units of $\mathcal{M}_{P}$ during their execution [32].

*Remark 7**.*

(Execution matrix) According to Definition 3, we introduce the group $\left(\mathcal{P}\left(A_{k,P}\right),\circ,\pi_{A_{k,P}}\right)$ where $\mathcal{P}\left(A_{k,P}\right)$ is the set of all the sub-algorithms of $A_{k,P}$ , and $\pi_{A_{k,P}}$ is the strict partial order relation between any two elements of $\mathcal{P}\left(A_{k,P}\right)$ that guarantees that two elements cannot be performed in any arbitrary order and simultaneously777The condition that two elements cannot be performed in any arbitrary order induces the inheritance of dependencies between decomposition subproblems and algorithm operators, while the condition that two elements cannot be performed simultaneously - relating to availability of resources - adds possible reasons for dependency between operators, which depend on the machine on which algorithm $A$ is intended to run [32].. We construct matrix $\mathcal{F}$ of order $r_{E}\cdot c_{E}$ , where $c_{E}=P$ 888In general $c_{E}\leq P$ , but we can exclude cases where dependencies existing between subproblems do not allow to use all the computing units available, i.e. in which $c_{E}<P$ , because they can easily taken back to the case where $c_{E}=P$ . as a dependency matrix (see Definition 4). The number of columns of this matrix will represent the maximum number of sub-algorithms that can be performed simultaneously on $\mathcal{M}_{P}$ . In the following, we denote this matrix as execution matrix and we refer to it by using the symbol $M_{E}(A_{k,P})=(e_{i,j})$ or simply $M_{E_{k,P}}$ if there is no ambiguity. Matrix $M_{E_{k,P}}$ is unique up to a permutation of elements on the same row. This matrix can be placed in analogy with the execution graphs (see [6, 9, 11, 30]) that are often used to describe the sequence of steps of an algorithm on a given machine for a particular input or a particular configuration.

*Remark 8**.*

As it is $card(A_{k,P})=card(D_{k}(\mathcal{B}_{N_{r}}))$ , then $M_{D_{k}}$ and $M_{E_{k,P}}$ have the same number of non empty elements ( $k$ ), whichever is $P\geq 1$ . If $c_{E}=P=c_{D_{k}}$ , it exists $A_{k,P}$ whose matrix $M_{E_{k,P}}$ has exactly the same structure of the matrix $M_{D_{k}}$ .

Definition 15.

$A_{k,P}$ * is said perfectly parallel if:*

$c_{E}>1$ ;

2.

$\forall\,i,j\,\,\,e_{i,j}\neq\emptyset$ .

$A_{k,P}$ * is said sequential if:*

$c_{E}=1$ ;

2.

$\nexists\,j>1\,:\,e_{i,j}\neq\emptyset$ .

$A_{k,P}$ * is said (simply) parallel if:*

$c_{E}>1$ ;

2.

$\exists\,i,j\,:\,e_{i,j}=\emptyset$ .

Moreover,

Every row of matrix $M_{E_{k,P}}$ such that $\exists\,e_{i,j}\neq\emptyset$ , where $j>1$ , is a parallel sub-algorithm of $A_{k,P}$ .

2.

Every row of matrix $M_{E_{k,P}}$ such that $\exists\,!\,e_{i,j}\neq\emptyset$ is a sequential sub-algorithm of $A_{k,P}$ .

*Remark 9**.*

Observe that the concurrency degree of $\mathcal{B}_{N_{r}}$ in a given decomposition provides an upper limit to the maximum number of independent sub-algorithms executable simultaneously on the machine. The dependency degree provides a lower limit to the execution time of the algorithm.

Finally, from correspondence $\gamma$ (see (5)), we say that $\mathcal{B}_{N_{r}}$ is solvable in $\mathcal{M}_{P}\Leftrightarrow\exists\,D_{k}(\mathcal{B}_{N_{r}})\in\mathcal{DB}_{N_{r}}\;:\,\exists\,A_{k,P}$ that solves $\mathcal{B}_{N_{r}}$ .

Theorem 16.

If $\mathcal{B}_{N_{r}}$ is perfectly decomposed according to $D_{k}$ , $\exists\,\mathcal{M}_{P}$ , where $P>1$ , such that $\exists A_{k,P}$ perfectly parallel that solves $\mathcal{B}_{N_{r}}$ .

Proof.

If $\mathcal{B}_{N_{r}}$ is perfectly decomposed then the matrix $M_{D_{k}}$ has not empty elements and has order greater than $1$ . Since $card(A_{k,P})=card(D_{k}(\mathcal{B}_{N_{r}}))=k$ , it exists $A_{k,P}$ with execution matrix $M_{E_{k,P}}$ of order $r_{E}\cdot c_{E}$ , with only non zero elements, such that

[TABLE]

or999If the concurrency degree $c_{D_{k}}$ is so great that we can not imagine a real machine with so many units, we can always use a number of computing units $P=c_{D_{k}}/n$ with $c_{D_{k}}\bmod(n)=0$ . This will mean that the execution matrix of $A_{k,P}$ will have $n$ times more rows and $n$ times less columns than the dependency matrix.

[TABLE]

with the integer $n$ is such that $n<c_{D_{k}}$ and $c_{D_{k}}\bmod n=0$ .

In conclusion,

$M_{E_{k,P}}$ has $c_{E}=P>1$ columns,

2.

no rows have an empty element;

so $A_{k,P}$ is perfectly parallel. ∎

3 Algorithm Performance Metrics

In this section we employ the mathematical settings we introduced in section 2, in order to define two quantities to measure the performance of an algorithm: the scale up and the speed up.

3.1 Scale Up

Let us consider two decompositions $D_{k_{i}}(\mathcal{B}_{N})$ and $D_{k_{j}}(\mathcal{B}_{N})$ in $\mathcal{DB}_{N}$ . Let us consider $A_{k_{i},P}$ and $A_{k_{j},P}$ representing their equivalence class in $AL$ . In order to measure the scalability of parallel algorithms we introduce the following quantity

Definition 17.

(Scale up factor) If $A_{k_{i},P}$ and $A_{k_{j},P}$ have the same granularity set (see Definition 12), the ratio

[TABLE]

*is said scale up factor of $\varrho(A_{k_{j},P})$ measured with respect to $\varrho(A_{k_{i},P})$ .

From Definition 14, it follows that*

[TABLE]

Next proposition quantifies the scale up when we solve the same problem with an algorithm that is the concatenation of several algorithms which are similar to the first one, with polynomial complexity of degree $d$ .

Proposition 18.

Given $\mathcal{B}_{N_{r}}$ , $D_{k}(\mathcal{B}_{N_{r}})$ and $D_{k^{\prime}}(\mathcal{B}_{N_{r}})=\{D_{k^{\prime}_{i}}(\mathcal{B}_{N_{q}})\}_{i=1,\mu}$ where

$N_{q}=N_{r}/\mu$ * with $\mu\in N$ , $\mu\leq N_{r}$ , and $N_{r}\bmod\mu=0$ ,*

2.

$\mathcal{B}_{N_{q}}\mathscr{S}\mathcal{B}_{N_{r}}$ ,

3.

$D_{k}\mathscr{S}D_{k^{\prime}_{i}}\mathscr{S}D_{k^{\prime}_{j}}$ , $\forall i\neq j$ .

Consider $A_{k,P}\in\varphi^{-1}(D_{k}(\mathcal{B}_{N_{r}}))$ and $A_{k^{\prime}_{i},P}\in\varphi^{-1}(D_{k^{\prime}_{i}}(\mathcal{B}_{N_{q}}))$ and assume that

$C(A_{k,P})=k=\mathcal{P}^{d}(N_{r})$ **

2.

$C(A_{k^{\prime}_{i},P})=k^{\prime}_{i}=\mathcal{P}^{d}(N_{q})$ **

where

[TABLE]

then

[TABLE]

where

[TABLE]

Proof.

We have that

[TABLE]

then from the (10), it follows that

[TABLE]

that is

[TABLE]

Since $N_{q}=N_{r}/\mu$ , then it is

[TABLE]

then thesis follows from the (11). ∎

Corollary 19.

If $N_{r}$ is fixed, and $\mu\simeq N_{r}$ it is $\xi(N_{r},\mu)=const\quad,\,const\in(0,1]$ , and $Sc_{up}(A_{k,P},A_{k^{\prime},P})\leq N_{r}^{d-1}$ . If $\mu$ is fixed, it is

[TABLE]

and

[TABLE]

If $a_{i}=0,$ $\forall i<d$ then $\xi(N_{r},\mu)=1$ and $S_{up}(A_{k,P},A_{k^{\prime},P})=\mu^{d-1}$ , $\forall\mu\quad.$

3.2 Speed Up

Let $tcalc$ be the execution time of one floating point operation.

*Remark 10**.*

In the following when we need to refer to execution time of the computing operators of $A_{k,P}$ we will use the following notation of the parameters $\beta^{calc}_{\ldots,M_{E_{k,P}}}$ highlighting the execution matrix $M_{E_{k,P}}$ characterizing the mapping of the algorithm on the machine $\mathcal{M}_{P}$ .

We assume that

[TABLE]

Definition 20.

(Row execution time) The quantity

[TABLE]

is said execution time of the row $r$ of $M_{E_{k,P}}$ (which is a sub-algorithm of $A_{k,P}$ ).

*Remark 11**.*

Let $\beta_{r,M_{E_{k,P}}}^{calc}:=\max_{j\in[0,c_{E}-1]}\beta^{calc}_{r_{j},M_{E_{k,P}}}$ then

[TABLE]

Note that $\beta_{i_{j},M_{E_{k,1}}}^{calc}\geq 1$ then $\beta_{r,M_{E_{k,1}}}^{calc}\geq 1$ .

Definition 21.

(Execution time) The quantity

[TABLE]

is said execution time of $A_{k,P}$ .

*Remark 12**.*

Let $\beta_{M_{E_{k,P}}}^{calc}:=\sum_{r=0}^{r_{E}-1}\beta^{calc}_{r,M_{E_{k,P}}}$ then $\beta_{M_{E_{k,P}}}^{calc}\geq r_{E}$ .

[TABLE]

*Remark 13**.*

Let

[TABLE]

Then, if $P=1$ then $\beta_{M_{E_{k,P}}}^{calc}:=\beta_{sum,M_{E_{k,P}}}^{calc}$ .

*Remark 14**.*

Let

$r_{seq}\leq r_{E}$ denote the number of rows of $M_{E_{k,P}}$ with only one non-empty element (sequential sub-algorithms of $A_{k,P}$ ).

2.

$r_{par}=r_{E}-r_{seq}$ , with $r_{par}\leq r_{E}$ , denote the number of rows of $M_{E_{k,P}}$ with more than one non empty element.

From the sequence $i=0,\ldots,r_{E}-1$ , numbering the $r_{E}$ rows of $M_{E_{k,P}}$ , two subsequences of indices originate $\{i_{q}\}_{q\in[0,r_{seq}-1]}$ , and $\{i_{r}\}_{r\in[0,r_{par}-1]}$ , and the following definition follows

Definition 22.

(Parallel Execution time) The quantity

[TABLE]

is said parallel execution time of $A_{k,P}$ .

Definition 23.

(Sequential Execution time) The quantity

[TABLE]

is said sequential execution time of $A_{k,P}$ .

The (18) can be written as

[TABLE]

This states that, by looking at matrix $M_{E_{k,P}}$ , the model expresses the size of the parallel and the sequential parts composing the execution time $A_{k,P}$ .

Let

[TABLE]

$R^{calc}$ is the parameter of the algorithm $A_{k,P}$ depending on the most computationally intensive sub-algorithms of $A$ .

It holds

[TABLE]

*Remark 15**.*

If $P=1$ , since $r_{E}=C(A_{k,1})=k$ from(24) it is

[TABLE]

Corollary 24.

From the (24) it follows

[TABLE]

and it assumes its minimum value when $r_{E}=r_{D}$ .

[TABLE]

Definition 25.

*(**Speed up in $\frac{AL}{\rho}$ *) Given $\mathcal{B}_{N_{r}}$ , two different decompositions $D_{k}(\mathcal{B}_{N_{r}})$ and $D_{k^{\prime}}(\mathcal{B}_{N_{r}})$ , and

$A_{k,P}\in\varphi^{-1}(D_{k}(\mathcal{B}_{N_{r}}))$ , where $P>1$ ,

2.

$A_{k^{\prime},1}\in\varphi^{-1}(D_{k^{\prime}}(\mathcal{B}_{N_{r}}))$ **

where $\mathcal{M}_{1}$ and $\mathcal{M}_{P}$ differ only on the number of processing elements, if $\mathcal{G}(A_{k,P})=\mathcal{G}(A_{k^{\prime},P})$ , then the speed up of $A_{k,P}$ with respect to $A_{k^{\prime},1}$ is

[TABLE]

*Remark 16**.*

(Ideal Speed up) Since it is always101010 $\beta_{M_{E}(A_{k,P})}^{calc}$ is the sum of the maximum operator time on each row, so $\beta_{sum,M_{E}(A_{k,P})}^{calc}$ can be equal to $P\cdot\beta_{M_{E}(A_{k,P})}^{calc}$ only if the operators have all the same time.

[TABLE]

then it holds that

[TABLE]

Definition 26.

(Speed up in $\rho(A_{k,P})$ ) The speed up of $A_{k,P}$ with respect to $A_{k,1}$ is

[TABLE]

4 Algorithms which are in the same equivalence class

We consider algorithms that are in the same equivalence class, i.e. those corresponding to the same decomposition of the problem

Theorem 27.

$\forall\,\mathcal{B}_{N_{r}}$ * perfectly decomposed according to the decomposition $D_{k}(\mathcal{B}_{N_{r}})$ , and $\forall\,A_{k,P}$ perfectly parallel algorithm that solves it on $\mathcal{M}_{P}$ with $P>1$ , if*

[TABLE]

it follows that:

[TABLE]

Proof.

If $A_{k,P}$ is perfectly parallel, then $M_{E_{k,P}}$ has no empty elements so

[TABLE]

Therefore, from the (25) and 27, it is

[TABLE]

∎

Theorem 28.

For all the matrices $M_{E_{k,P}}$ of algorithms in $\varrho(A_{k,P})$ , it holds

[TABLE]

and

[TABLE]

Moreover, let us consider $A_{k,P}^{i}$ and $A_{k,P}^{j}$ two algorithms belonging to $\varrho(A_{k,P})$ , and their matrices $M_{E_{k,P}}^{i}$ and $M_{E_{k,P}}^{j}$ . We have:

$c_{E}^{i}<c_{E}^{j}\Rightarrow r_{E}^{i}\geq r_{E}^{j}$ ;

2.

$c_{E}^{i}>c_{E}^{j}\Rightarrow r_{E}^{i}\leq r_{E}^{j}$ .

Proof.

From inheritance on $A_{k,P}$ of dependencies defined on $D_{k}(\mathcal{B}_{N_{r}})$ , it is not possible that $c_{E}>c_{D}$ , therefore $c_{E}\leq c_{D_{k}}$ . Then there is at least one row of $M_{D_{k}}$ with $c_{D_{k}}$ non-empty elements. Let $d$ be the difference between $c_{D_{k}}$ and $c_{E}$ . Therefore, since $M_{D_{k}}$ and $M_{E}$ have the same number of non-empty elements, it is $r_{E}\geq r_{D}+\lceil(d/c_{E}.)\rceil$ .

Similarly, it can be proved that if $c^{i}_{E}<c^{j}_{E}$ then $r^{i}_{E}\geq r^{j}_{E}$ , and if $c^{i}_{E}>c^{j}_{E}$ then $r^{i}_{E}\leq r^{j}_{E}$ . ∎

*Remark 17**.*

The minimum execution time is proportional to the dependency degree of $\mathcal{B}_{N_{r}}$ , that is when the number of computing units is equal to the concurrency degree of $\mathcal{B}_{N_{r}}$ .

We now define a subset of the equivalence class of $\varrho(A_{k,P})$ . Let $\simeq$ be the equivalence relation identifying two algorithms with the same $P$ . Then

[TABLE]

i.e. consisting of the representatives of the equivalence classes of $\simeq$ 111111For example, we can take the algorithm in $\hat{\varrho}(A_{k,P})$ , $P\geq 1$ , whose execution matrix has the fewest number of rows..

Let us now consider matrices $M_{E_{k,P}}$ associated to algorithms belonging to $\hat{\varrho}(A_{k,P})$ , varying $P$ .

The following result defines the speed up of a parallel algorithm with respect to the sequential algorithm belonging to its class.

Theorem 29.

Consider $A_{k,1}\buildrel\varrho\over{\equiv}A_{k,P}$ with

[TABLE]

It holds

[TABLE]

Proof.

From the (25), (26) and (32), it follows

[TABLE]

∎

Corollary 30.

Since $(r_{E_{P}}\cdot c_{E_{P}})\geq C(A_{k,P})$ , from the (39) it follows that

[TABLE]

Definition 31.

(Ideal Speed up in $\hat{\varrho}(A_{k,P})$ ) We let

[TABLE]

be the ideal speed up.

Let $r_{par_{i}}$ denote the number of rows having $i>1$ not empty elements, and $r_{par_{1}}=r_{seq}$ , then it is

[TABLE]

Definition 32.

(Total Time of $A$ with $i$ non empty elements) Let $T_{j_{i}}$ the time of a row with $i\geq 1$ not empty elements elements. The quantity

[TABLE]

is the execution time of the part of $A$ with $i$ non empty elements on each row.

*Remark 18**.*

It holds that

$r_{par}=r_{E_{P}}-r_{seq}=\sum_{i=2}^{P}r_{par_{i}}$ then $T_{par_{1}}(A_{k,P})=T_{seq}(A_{k,P}).$

Next result shows how the generalized Amdhal’s Law can be derived by using the rows of the execution matrix $M_{E_{k,P}}$ having at least one non empty element.

Theorem 33.

(Generalized Amdhal’s Law) It is

[TABLE]

where

[TABLE]

Proof.

From (39) it is

[TABLE]

By dividing for $C(A_{k,P})$ it follows that

[TABLE]

that is

[TABLE]

∎

Then, the Amdhal’s Law [2] comes out as a particular case of the previous theorem

Corollary 34.

(Amdhal’s Law) If we assume that $M_{E_{k,1}}$ only has rows with $1$ element or $P$ elements, we have

[TABLE]

where

[TABLE]

Proof.

From (42) it follows that

[TABLE]

where

[TABLE]

and

[TABLE]

If the rows with more than one non empty element have $P$ elements, it is

[TABLE]

therefore, if we let $\alpha_{1}=\alpha=\frac{r_{seq}}{C(A_{k,P})}$ we get

[TABLE]

∎

Let $Q$ denote the cost of $A_{k,P}$ . The cost is defined as the product of the execution time and the number of processors utilized [17]. In this mathematical settings it holds that the cost $Q$ can be written as

[TABLE]

If $c_{E}=1$ , from the (27) it holds

[TABLE]

The overhead of $A_{k,P}$ is the total time spent by all the processing elements over and above that spent in useful computation.

Definition 35.

(Algorithm Overhead) The quantity

[TABLE]

is said overhead of $A_{k,P}$ .

Theorem 36.

It holds

[TABLE]

Proof.

It holds

[TABLE]

Moreover,

[TABLE]

therefore it follows from (51)

[TABLE]

and the (52) follows. ∎

Definition 37.

**(Ideal Overhead in $\hat{\varrho}(A_{k,P})$ )

From the (52) it follows**

[TABLE]

Let $Ef(A_{k,P}):=\frac{Sp(A_{k,P})}{P}$ be the efficiency of $A$ where $P\geq 1$ .

Theorem 38.

Let $N^{E}_{P}=c_{E_{P}}\cdot r_{E_{P}}$ , denote the dimension of the execution matrix of $A_{k,P}$ , it holds that

[TABLE]

Proof.

Since $c_{E}=P$ , it follows that

[TABLE]

∎

Definition 39.

**(Ideal Efficiency in $\hat{\varrho}(A_{k,P})$ )

Since $Sp(_{k,P})\leq P\cdot\frac{R^{calc}(A_{k,1})}{R^{calc}(A_{k,P})}$ , it always is $Ef(A_{k,P})\leq\frac{R^{calc}(A_{k,1})}{R^{calc}(A_{k,P})}$ . So let**

[TABLE]

be the ideal efficiency of $A_{k,P}$ .

*Remark 19**.*

It is worth to note the role of parameters $R^{calc}(A_{k,P})$ and $R^{calc}(A_{k,1})$ in (46), (54) and (55). If in $A_{k,P}$ there are few operators which are much more time consuming than the others, and $k>>r_{E}$ then $\beta_{M_{E_{k,P}}}^{calc}\simeq\beta_{sum,M_{E_{k,1}}}^{calc}$ and $R^{calc}(A_{k,P})>>R^{calc}(A_{k,1})$ . The more the operators are and the greater the difference is in (54), or the lower the ratio is in (46) and (55). Hence, the greater the overhead is, the lower the speed up and the efficiency are. This is a consequence of a problem decomposition, associated to $A_{k,P}$ not well balanced.

Let us now suppose that the algorithm $A_{k,P}$ is perfectly parallel, that is its execution matrix $M_{E_{P}}$ has not any empty element. Since $r_{E_{P}}\cdot c_{E_{P}}=C(A_{k,P})$ it follows from (40) that

[TABLE]

from (52) that

[TABLE]

from (57)

[TABLE]

*Remark 20**.*

If $P=c_{D}$ , $r_{E}=r_{D}$ and $c_{E}=c_{D}$ , if $P=c_{D}$ then the following results hold on:

$Q(A_{k,P})=c_{D}\cdot r_{D}\cdot R^{calc}(A_{k,P})\cdot tcalc=N_{D}\cdot R^{calc}(A_{k,P})\cdot tcalc$ ; 2. 2.

$Sp(A_{k,P})=\frac{C(A_{k,P})}{r_{D}}\frac{R^{calc}(A_{k,1})}{R^{calc}(A_{k,P})};$ 3. 3.

$Oh(A_{k,P})=(c_{D}\cdot r_{D}-C(A_{k,P}))\cdot R^{calc}(A_{k,P})\cdot tcalc;$ 4. 4.

$Ef(A_{k,P})=\frac{C(A_{k,P})}{r_{D}\cdot c_{D}}\frac{R^{calc}(A_{k,1})}{R^{calc}(A_{k,P})}\quad.$

5 Algorithms with operators having the same execution time

We assume that all the operators of the algorithm have the same execution time. For example they are the elementary floating point operations. The execution time is $\beta^{calc}\cdot tcalc$ , and without loss of generality we assume that $\beta^{calc}=1$ . Hence, it follows that,

[TABLE]

Finally, from (24) it follows that

[TABLE]

Hence, we get

$Sp(A_{k,P},A_{k^{\prime},1}):=\frac{k^{\prime}}{k}\cdot\frac{k}{r_{E}}\quad,$

2.

if $Q=1$ , then $Sp(A_{k,P}):=\frac{k}{r_{E}}\quad,$

3.

$Sp_{Ideal}(A_{k,P},A_{k^{\prime},1})=Sc_{up}(A_{k,P},A_{k^{\prime},1})\cdot P=\frac{k^{\prime}}{k}\cdot P\quad,$

4.

$Sp_{Ideal}(A_{k,P})=c_{E_{P}}=P\quad.$

Finally, if $\mathcal{B}_{N_{r}}$ is perfectly decomposed then

[TABLE]

i.e. $A_{k,P}$ has the ideal speed up in the classical definition.

Let us now consider matrices $M_{E_{k,P}}$ associated with algorithms in $\hat{\varrho}(A_{k,P})$ , varying $P$ . The following results hold

$Q(A_{k,P})=c_{E}\cdot r_{E}\cdot tcalc$ and, if $c_{E}=1$ , then $Q(A_{k,1})=k\cdot tcalc\quad;$ 2. 2.

$Oh_{Ideal}(A_{k,P})=0\quad;$ 3. 3.

$Ef_{Ideal}(A_{k,P})=1\quad.$

Finally, next result relates the overhead to the sparsity degree of the execution matrix.

Theorem 40.

Let suppose that

[TABLE]

Given $A_{k,P}$ , $P>1$ , $M_{E_{k,P}}$ of order $N^{E}_{P}=r_{E}\cdot P$ , let $V_{r}$ be the number of empty elements of the row $r$ of $M_{E_{k,P}}$ ; it is

[TABLE]

Proof.

It holds that

[TABLE]

then from (51)

[TABLE]

∎

*Remark 21**.*

Note that $\sum_{r=0}^{r_{E}-1}V_{r}$ is the sparsity degree of the execution matrix.

Following table collects the expressions of the quantities that we have derived and that characterize the mathematical framework.

Among the decomposition approaches, recursive decomposition is the most suitable for our performance model, especially for a real-world algorithm. In this case, as described in the example below, a problem is solved by first decomposing it into a set of independent sub-problems. Furthermore, each one of these sub-problems is solved by applying a similar decomposition into smaller subproblems followed by a combination of their results, and so on. In this way we get a decomposition matrix whose elements can be subsequently decomposed until the desired level of detail which is considered the most suitable for the subsequent analysis.

Example: Let $\mathcal{B}_{16}$ denote the computational problem of the sum of $16$ real numbers and $D_{3}(\mathcal{B}_{16})=\{B_{8},B_{8},B_{2}\}\in\mathcal{D}\mathcal{B}_{16}.$ The decomposition matrix is

[TABLE]

If $\mathcal{B}_{8}$ can be decomposed as $D^{1}_{3}(\mathcal{B}_{8})=\{\mathcal{B}_{4},\mathcal{B}_{4},\mathcal{B}_{2}\}\in\mathcal{D}\mathcal{B}_{8}$ then

[TABLE]

In the same way, if $\mathcal{B}_{4}$ can be decomposed as $D^{2}_{3}(\mathcal{B}_{4})=\{\mathcal{B}_{2},\mathcal{B}_{2},\mathcal{B}_{2}\}\in\mathcal{D}\mathcal{B}_{8}$ and

[TABLE]

We have three decompositions for $\mathcal{B}_{16}$ :

[TABLE]

with the following characteristics, according to the corresponding decomposition matrices:

$D_{3}$ : cardinality $3$ , concurrency degree $2$ and dependence degree $2$ ,

2.

$D_{7}$ : cardinality $7$ , concurrency degree $4$ and dependence degree $3$ ,

3.

$D_{15}$ : cardinality $15$ , concurrency degree $8$ and dependence degree $4$ .

meaning that the intrinsic concurrency of a problem heavily depends on the decomposition chosen for that problem. Each decomposition has a level of detail depending on the type of subproblems that are considered.

[TABLE]

6 Conclusion

Recent activities of major chip manufacturers show more evidence than ever that future designs of microprocessors and large systems will be heterogeneous in nature, relying on the integration of two major types of components. On the first hand, multi/many-cores CPU technology have been developed and the number of cores will continue to escalate because of the need to pack more and more components on a chip. On the other hand special purpose hardware and accelerators, especially Graphics Processing Units are in commodity production. Finally, reconfigurable architectures such as Field Programmable Gate Arrays offer several parameters such as operating frequency, precision, amount of memory, number of computation units, etc. These parameters define a large design space that must be explored to find efficient solutions [26]. To cope with this scenario, performance analysis of parallel algorithms should be re-evaluated to find out the best-practice algorithm on novel architectures [3, 16, 19, 20, 28, 33]. In this paper we presented a mathematical framework which can be used to get a multilevel description of a parallel algorithm, and we proved that it can be suitable for analysing the mapping of the algorithm on a given machine. The model allows the choice of a level of abstraction of the problem decomposition and of the algorithm determining the level of granularity of the performance analysis. This feature can be very useful for analysing the mapping of the algorithm on novel architectures. We have assumed abstract models for both the algorithms and the architectures and made numerous simplifying assumptions. However, we believe that a simplified parameterized model gives an useful generalization for better understanding algorithms that can run really fast no matter how complicated the underlying computer architecture [15].

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor Crummey, and N. R. Tallent, Hpctoolkit: tools for performance analysis of optimized parallel programs, Concurrency and Computation: Practice and Experience, vol. 22, no. 6, pp. 685-701, 2010.
2[2] G.M. Amdahl, Validity of the single-processor approach to achieving large scale computing capabilities, in AFIPS Conference Proceedings, vol. 30 (Atlantic City, N.J.. Apr. 18-20). AFIPS Press, Reston. Va., pp. 483-485, 1967.
3[3] G. Ballard , J. Demmel , O. Holtz , O. Schwartz, Minimizing Communication in Numerical Linear Algebra. SIAM Journal on Matrix Analysis and Applications. Volume 32, Issue 3, pp 866-901. 2011.
4[4] F. Berman, L. Snyder, Mapping parallel algorithms into parallel architectures, Journal of Parallel and Distributed Computing, Vol. 4, N. 5, 1987, pp. 439-458.
5[5] F. Berman, The mapping problem in parallel computation, in Mathematical Aspects of Scientific Software, J.R. Rice (Ed.), IMA Volumes in Mathematics and its Applications, Vol. 14, Springer-Verlag, 1988.
6[6] A.J. Bernstein, Analysis of programs for parallel processing, IEEE Transactions on Electronic Computers EC-15 (5), pp. 757-763, 1966.
7[7] S. H. Bokhari, On the mapping problem, IEEE Transaction on Computers, Vol. 30, N. 31, 1981, pp. 207-214.
8[8] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, A Portable Programming Interface for Performance Evaluation on Modern Processors, Int. J. High Perform. Comput. Appl., vol. 14, no. 3, pp. 189-204, 2000.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Multilevel Approach for the

Abstract

keywords:

1 Introduction and Motivation

1.1 Organization of the article

1.2 Related works

2 Preliminary Concepts and Definitions

Definition 1**.**

Definition 2**.**

Definition 3**.**

Remark 1*.*

Definition 4**.**

Remark 2*.*

2.1 The Problem Decomposition

Definition 5**.**

Definition 6**.**

Remark 3*.*

Definition 7**.**

2.2 The computing architecture

Definition 8**.**

Definition 9**.**

2.3 The Algorithm

Definition 10**.**

Definition 11**.**

Definition 12**.**

Definition 13**.**

Definition 14**.**

Remark 4*.*

Remark 5*.*

Remark 6*.*

Remark 7*.*

Remark 8*.*

Definition 15**.**

Remark 9*.*

Theorem 16**.**

Proof.

3 Algorithm Performance Metrics

3.1 Scale Up

Definition 17**.**

Proposition 18**.**

Proof.

Corollary 19**.**

3.2 Speed Up

Remark 10*.*

Definition 20**.**

Remark 11*.*

Definition 21**.**

Remark 12*.*

Remark 13*.*

Remark 14*.*

Definition 22**.**

Definition 23**.**

Remark 15*.*

Corollary 24**.**

Definition 25**.**

Remark 16*.*

Definition 26**.**

4 Algorithms which are in the same equivalence class

Theorem 27**.**

Proof.

Theorem 28**.**

Proof.

Remark 17*.*

Theorem 29**.**

Proof.

Corollary 30**.**

Definition 31**.**

Definition 32**.**

Remark 18*.*

Theorem 33**.**

Proof.

Corollary 34**.**

Proof.

Definition 35**.**

Definition 1.

Definition 2.

Definition 3.

*Remark 1**.*

Definition 4.

*Remark 2**.*

Definition 5.

Definition 6.

*Remark 3**.*

Definition 7.

Definition 8.

Definition 9.

Definition 10.

Definition 11.

Definition 12.

Definition 13.

Definition 14.

*Remark 4**.*

*Remark 5**.*

*Remark 6**.*

*Remark 7**.*

*Remark 8**.*

Definition 15.

*Remark 9**.*

Theorem 16.

Definition 17.

Proposition 18.

Corollary 19.

*Remark 10**.*

Definition 20.

*Remark 11**.*

Definition 21.

*Remark 12**.*

*Remark 13**.*

*Remark 14**.*

Definition 22.

Definition 23.

*Remark 15**.*

Corollary 24.

Definition 25.

*Remark 16**.*

Definition 26.

Theorem 27.

Theorem 28.

*Remark 17**.*

Theorem 29.

Corollary 30.

Definition 31.

Definition 32.

*Remark 18**.*

Theorem 33.

Corollary 34.

Definition 35.

Theorem 36.

Definition 37.

Theorem 38.

Definition 39.

*Remark 19**.*

*Remark 20**.*

Theorem 40.

*Remark 21**.*