A Multilevel Approach for the Performance Analysis of Parallel Algorithms
Luisa D'Amore, Valeria Mele, Diego Romano, Giuliano Laccetti

TL;DR
This paper introduces a multilevel framework for analyzing parallel algorithm performance by decomposing algorithms into operators and matrices to reveal parallelism and overheads.
Contribution
It presents a novel multilevel method that models parallel algorithms through operator sets and matrix representations to better understand their performance characteristics.
Findings
Decomposition level influences algorithm granularity.
Block matrices reveal inherent parallelism.
Analysis identifies sources of overheads.
Abstract
We provide a multilevel approach for analysing performances of parallel algorithms. The main outcome of such approach is that the algorithm is described by using a set of operators which are related to each other according to the problem decomposition. Decomposition level determines the granularity of the algorithm. A set of block matrices (decomposition and execution) highlights fundamental characteristics of the algorithm, such as inherent parallelism and sources of overheads.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Multilevel Approach for the
Performance Analysis of Parallel Algorithms
L. D’Amore
V. Mele
D. Romano
G. Laccetti
University of Naples, Federico II, Naples (IT)
Institute of High Performance Computing and Networking (ICAR), CNR, Naples (IT)
Abstract
We provide a multilevel approach for analysing performances of parallel algorithms. The main outcome of such approach is that the algorithm is described by using a set of operators which are related to each other according to the problem decomposition. Decomposition level determines the granularity of the algorithm. A set of block matrices (decomposition and execution) highlights fundamental characteristics of the algorithm, such as inherent parallelism and sources of overheads.
keywords:
Algorithm, Performance Metrics, Parallelism.
1 Introduction and Motivation
Numerical algorithms are at the heart of the software that enable scientific discoveries. The development of effective algorithms has a tremendous impact on harnessing emerging computer architectures to achieve new science. The mapping problem, first considered in 1980s [7], refers to the implementation of algorithms on a given target architecture which is capable to maximize some performance metrics [4, 5, 27, 28, 31]. Due to the multidimensional heterogeneity of modern architectures, it is becoming increasingly clear that using the performance metrics in a one-size-fits-all approach fails to discover sources of performance degradation that hamper to deliver the desired performance level. We believe that a performance model based on problem-specific features, as well as on mathematical tools to better analyze and understand algorithm behavior, should be developed. The present article attempts to collect our efforts in this area.
We briefly summarize how the performance model we provide in this work originates. We firstly address the basic structural features of algorithms which are dictated by data and operator dependencies [32]. These dependencies refer to relations among computations which need to be satisfied in order to compute the problem solution correctly. The absence of dependencies indicates the possibility of parallel computations. So the study of data dependencies in an algorithm becomes the most critical step in parallelising the computations of the algorithm. Then, in analogy to the graph of dependency between tasks, we introduce the algorithm as a set of operators starting from a predetermined decomposition of the problem described by a suitably defined matrix, called decomposition matrix. The mapping of the algorithm on the computing machine is described by the execution matrix.
1.1 Organization of the article
Section 2 will review basic concepts and definitions useful for setting up the mathematical framework. We define the decomposition matrix; following [32], we describe a parallel algorithm as an ordered set of operators, moreover we give the definition of complexity of the algorithm depending on the number of such operators; finally, we define the execution matrix describing the mapping of the algorithm on the target computing resource. Section 3 focuses on two metrics characterizing the algorithm performance, such as the scale up factor and the speed up. In Section 4 we analyse the performance of parallel algorithms arising from the same problem decomposition. We derive the Generalized Amdhal’s Law and some important upper and lower bounds of the performance metrics. In Section 5 we consider the particular case where the operators of an algorithm have the same execution time (namely, the operators are the usual floating point operations); in other words, we are assuming to get a decomposition at the lowest level of granularity and we derive the standard expressions for the performance metrics. In Section 6 conclusions are drawn.
1.2 Related works
The appropriate mapping depends upon both the specification of the algorithm and the underlying architecture. Firstly, it implies a transformation of the algorithm into an equivalent but more appropriate form. Works on the mapping problem can be classified according to the used representation. Graph based approaches perform transformations on the algorithm and the architecture represented as graphs. In this approach the algorithm is modeled in terms of graphs structures and the mapping in terms of graphs partitions [7]. Linear algebra approaches represent the graph and its data dependencies by a matrix, then transform the graph by performing matrix operations. Language based approaches transform one form of program text into another form, where the target form textually incorporates information about the architecture [24]. Characteristic based approaches represent the algorithm in terms of a set of characteristics which determines the transformations. Included in this category is the work of [29], where a technique which abstracts a computation in terms of its data dependencies is described. The method is based on a mathematical transformation of the index sets and of the data-dependency vectors associated with the given algorithm.
One common issue of the aforementioned approaches is that very often the model used for the representation of the algorithm cannot be explicitly employed for deriving the expression of the algorithm’s performance metrics. On the contrary, performance analysis is often accomplished with automatic tools on a combination of the algorithm and the parallel architecture on which it is implemented (the so-called parallel system), exploiting automating mappings, automatic translations, re-targeting mappings tracing, auto-tuning tools (such as: the PaRSEC runtime system [10], that provides a portable way to automatically adapt algorithms to new hardware trend. Nevertheless, these approaches ignore the properties of the problem decomposition. Instead, our model allows to choose a level of abstraction of the problem decomposition and of the algorithm description which determine the level of granularity of the performance analysis. A set of parameters are used both to describe the problem and to compute speed up, efficiency, cost, overhead, scale up and operating point of the algorithm, starting from the problem decomposition. Metrics and their asymptotic estimates, which represent upper or lower bounds of the algorithm’s performance depend on parameters characterizing the structure of the two matrices, namely their number of rows and columns, and on computing environment parameters, such as the execution time for one floating point operation.
2 Preliminary Concepts and Definitions
We introduce a dependency relationship among component parts of a computational problem, among operators of the algorithm that solves the problem and, finally, among memory accesses of the algorithm. In this way we are able to define two matrices (decomposition and execution) which highlights fundamental characteristics of the algorithm and which are the foundations of the mathematical model we are going to introduce. To this aim we first give some definitions which we refer in this work111It is worth to note that these definitions do not claim to be general. Their aim is to establish the mathematical setting on which we will restrict our attention..
Definition 1**.**
(Computational Problem) A computational problem is the mathematical problem specified by an input/output function:
[TABLE]
where is the input data size and , between the data of and the solution of .
Therefore, in the following we assume that the computational problem is identified by the triple:
[TABLE]
Definition 2**.**
(Similar Computational Problem) Two computational problems, and , are said similar if they are specified by the same functional relation and they only differ in the input/output data size. If and are similar we write .
Dividing a computation into smaller computations, some or all of which may potentially be executed in parallel, is the key step in designing parallel algorithms. The parts that a problem is decomposed into often share input, output, or intermediate data. The dependencies usually result from the fact that the output of one part is the input for another. In our mathematical framework the relationship among component parts of a computational problem will be described by the so called decomposition matrix. In order to define this matrix we need to introduce the following algebraic structure
Definition 3**.**
(Dependency Group) Let be a group and let be a strict partial order relation on , which is compatible with . We say that any element of , let us say , depends on an element of , let us say , if , and we write . If and do not depend on each other we write . The group equipped with is called dependency group and it is denoted as .
Remark 1*.*
Since is transitive, from Definition 2 it follows that any two elements of , let us say and , are independent if there is no any relationship between them. In this case we write and , or even .
Now we are able to define the dependency matrix on .
Definition 4**.**
*(Dependency Matrix) Given , the matrix222For simplicity of notation, in the following we will continue to define matrices in the usual sense of matrix calculation; seen as a family, dependency matrix is defined by the triple:
where is an application between and the set of indices. , of size , whose elements , are such that *
[TABLE]
and s.t.
[TABLE]
while the others elements are set equal to zero, is said the dependency matrix.
Remark 2*.*
Matrix is unique (through its construction), up to a permutation of elements on the same row. is said the concurrency degree333A similar concept has already been highlighted in [21] of and is the said the dependency degree of . Concurrency degree measures the intrinsic concurrency among sub-problems of . It is obtained as the number of columns of .
2.1 The Problem Decomposition
Let denote the solution444Here, for the sake of simplicity, we assume that exists and it is unique. of .
Definition 5**.**
(Decomposition of a computational problem) Given , any finite set of computational problems , where , such that , where , and
[TABLE]
is called a decomposition of . denotes a sub-problem of . A decomposition of , which is denoted as
[TABLE]
defines the computational problem
[TABLE]
The set of all the decompositions of is denoted as .
Definition 6**.**
(Similar Decompositions) Given , two decompositions and are called similar if
[TABLE]
and
[TABLE]
and we write
[TABLE]
Remark 3*.*
(Decomposition matrix) In order to capture interactions among component parts (or sub-problems) of , we use the dependency matrix on . More precisely, by using Definition 2 we introduce the group where is any application between any two elements and of , equipped with the strict partial order relation . Then, we construct the (unique) dependency matrix corresponding to the decomposition . In the following we denote this matrix as , or for simplicity, and we refer to it as the decomposition matrix. Given , let denote the number of columns. This is the (unique) concurrency degree of . Let denote the row number of rows. This is the (unique) dependency degree of . Concurrency degree measures the intrinsic concurrency among sub-problems of .
We observe that, if there are not empty elements, the problem has the highest intrinsic concurrency, hence we give the following
Definition 7**.**
(Perfectly Decomposed Problems) is said perfectly decomposed if and such that
* ;*
- 2.
, .
The next step is to take these parts and assign them (i.e., the mapping step) onto the computing machine. In the next section we introduce the computing environment characterized by the set of logical-operational operators/operations that it is able to apply/execute.
2.2 The computing architecture
We introduce the machine equipped with processing elements with specific logical-operational capabilities such as: basic operations (arithmetic,), special functions evaluations (), solvers (integrals, equations system, non linear equations). These are the computing operators of . In particular, we will use the following characterization of operators of .
Definition 8**.**
(Computing Operators) The operator of is a correspondence between and , where are positive integers.
Given , the set without repetitions
[TABLE]
where , characterizes logical-operational capabilities of the machine . Operators, properly organized, provide the solution to , as stated in the following
Definition 9**.**
(Solvable Problems) is solvable in if
[TABLE]
that is, if it exists any relation
[TABLE]
In particular, we say that a decomposition is suited for if is a function. From now on, we consider as solvable any problem , and as fixed any decompositions suited for . 555Note that there is no loss of generality..
We associate execution time (measured, for instance, in seconds) to each in . If , we set .
2.3 The Algorithm
In the literature, an algorithm is any procedure consisting of finite number of unambiguous rules that specify a finite sequence of operations to reach a solution to a problem or a specific class of problems [23]. Here we define an algorithm as a proper set of operators which solves , as stated in the following
Definition 10**.**
(Algorithm) Given , an algorithm solving , indicated as
[TABLE]
is a sequence of elements (not necessarily distinct) of , such that 666In the following we use the symbol to denote correspondence composition.
[TABLE]
where , and such that there is a bijective correspondence
[TABLE]
Every ordered subset of is a sub-algorithm of .
For simplicity of notations and when there is no ambiguity, we indicate algorithms briefly as .
Definition 11**.**
(Equal Algorithms) Two algorithms
[TABLE]
are said equal if
Note that two equal algorithms have the same cardinality.
Definition 12**.**
(Granularity set of an Algorithm) Given , the subset of made of distinct operators of defines the granularity set of . Two algorithms
[TABLE]
have the same granularity if .
Let (or simply ) be the set of algorithms that solve , obtained by varying , the number of processing units and . Even if one can easily formulate infinite variations of an algorithm that do the same thing, in the following we assume to be finite.
Definition 13**.**
(The quotient set ) Let
[TABLE]
be the surjective correspondence which induces on an equivalence relationship of in itself, such that
[TABLE]
*The set consists of algorithms of associated with the same decomposition . induces the quotient set , whose elements are disjoints and finite subsets of determined by , that is they are equivalence classes under .
In the following we assume to represent its equivalence class in .
Definition 14**.**
(Complexity) The cardinality of , denoted as , is said complexity of . It is
[TABLE]
Remark 4*.*
equals to the number of non empty elements of , i.e. the decomposition matrix defined on . By virtue of the bijective correspondence in (5), it holds that
[TABLE]
So, each algorithm belonging to the same equivalence class according to has the same complexity. An integer (the complexity) is therefore associated with each element of quotient set which induces an ordering relation between the equivalence classes in : therefore there is a minimum complexity for algorithms that solve the problem .
Remark 5*.*
(Similar Algorithms) Given and their relative similar decompositions with and (see Definition 6), algorithms belonging to (see (6)) are similar to algorithms belonging to . From Definition 6 and 14 and the (8), it follows that
[TABLE]
that is similar algorithms have the same complexity.
Remark 6*.*
As we can associate to each subproblem according to , then the operators of inherit the dependencies existing between subproblems of , but they do not inherit independencies, because for instance, two operators may depend on the availability of computing units of during their execution [32].
Remark 7*.*
(Execution matrix) According to Definition 3, we introduce the group where is the set of all the sub-algorithms of , and is the strict partial order relation between any two elements of that guarantees that two elements cannot be performed in any arbitrary order and simultaneously777The condition that two elements cannot be performed in any arbitrary order induces the inheritance of dependencies between decomposition subproblems and algorithm operators, while the condition that two elements cannot be performed simultaneously - relating to availability of resources - adds possible reasons for dependency between operators, which depend on the machine on which algorithm is intended to run [32].. We construct matrix of order , where 888In general , but we can exclude cases where dependencies existing between subproblems do not allow to use all the computing units available, i.e. in which , because they can easily taken back to the case where . as a dependency matrix (see Definition 4). The number of columns of this matrix will represent the maximum number of sub-algorithms that can be performed simultaneously on . In the following, we denote this matrix as execution matrix and we refer to it by using the symbol or simply if there is no ambiguity. Matrix is unique up to a permutation of elements on the same row. This matrix can be placed in analogy with the execution graphs (see [6, 9, 11, 30]) that are often used to describe the sequence of steps of an algorithm on a given machine for a particular input or a particular configuration.
Remark 8*.*
As it is , then and have the same number of non empty elements (), whichever is . If , it exists whose matrix has exactly the same structure of the matrix .
Definition 15**.**
* is said perfectly parallel if:*
;
- 2.
.
* is said sequential if:*
;
- 2.
.
* is said (simply) parallel if:*
;
- 2.
.
Moreover,
Every row of matrix such that , where , is a parallel sub-algorithm of .
- 2.
Every row of matrix such that is a sequential sub-algorithm of .
Remark 9*.*
Observe that the concurrency degree of in a given decomposition provides an upper limit to the maximum number of independent sub-algorithms executable simultaneously on the machine. The dependency degree provides a lower limit to the execution time of the algorithm.
Finally, from correspondence (see (5)), we say that is solvable in that solves .
Theorem 16**.**
If is perfectly decomposed according to , , where , such that perfectly parallel that solves .
Proof.
If is perfectly decomposed then the matrix has not empty elements and has order greater than . Since , it exists with execution matrix of order , with only non zero elements, such that
[TABLE]
or999If the concurrency degree is so great that we can not imagine a real machine with so many units, we can always use a number of computing units with . This will mean that the execution matrix of will have times more rows and times less columns than the dependency matrix.
[TABLE]
with the integer is such that and .
In conclusion,
has columns,
- 2.
no rows have an empty element;
so is perfectly parallel. ∎
3 Algorithm Performance Metrics
In this section we employ the mathematical settings we introduced in section 2, in order to define two quantities to measure the performance of an algorithm: the scale up and the speed up.
3.1 Scale Up
Let us consider two decompositions and in . Let us consider and representing their equivalence class in . In order to measure the scalability of parallel algorithms we introduce the following quantity
Definition 17**.**
(Scale up factor) If and have the same granularity set (see Definition 12), the ratio
[TABLE]
*is said scale up factor of measured with respect to .
From Definition 14, it follows that*
[TABLE]
Next proposition quantifies the scale up when we solve the same problem with an algorithm that is the concatenation of several algorithms which are similar to the first one, with polynomial complexity of degree .
Proposition 18**.**
Given , and where
* with , , and ,*
- 2.
,
- 3.
, .
Consider and and assume that
**
- 2.
**
where
[TABLE]
then
[TABLE]
where
[TABLE]
Proof.
We have that
[TABLE]
then from the (10), it follows that
[TABLE]
that is
[TABLE]
Since , then it is
[TABLE]
then thesis follows from the (11). ∎
Corollary 19**.**
If is fixed, and it is , and . If is fixed, it is
[TABLE]
and
[TABLE]
If then and ,
3.2 Speed Up
Let be the execution time of one floating point operation.
Remark 10*.*
In the following when we need to refer to execution time of the computing operators of we will use the following notation of the parameters highlighting the execution matrix characterizing the mapping of the algorithm on the machine .
We assume that
[TABLE]
Definition 20**.**
(Row execution time) The quantity
[TABLE]
is said execution time of the row of (which is a sub-algorithm of ).
Remark 11*.*
Let then
[TABLE]
Note that then .
Definition 21**.**
(Execution time) The quantity
[TABLE]
is said execution time of .
Remark 12*.*
Let then .
[TABLE]
Remark 13*.*
Let
[TABLE]
Then, if then .
Remark 14*.*
Let
denote the number of rows of with only one non-empty element (sequential sub-algorithms of ).
- 2.
, with , denote the number of rows of with more than one non empty element.
From the sequence , numbering the rows of , two subsequences of indices originate , and , and the following definition follows
Definition 22**.**
(Parallel Execution time) The quantity
[TABLE]
is said parallel execution time of .
Definition 23**.**
(Sequential Execution time) The quantity
[TABLE]
is said sequential execution time of .
The (18) can be written as
[TABLE]
This states that, by looking at matrix , the model expresses the size of the parallel and the sequential parts composing the execution time .
Let
[TABLE]
is the parameter of the algorithm depending on the most computationally intensive sub-algorithms of .
It holds
[TABLE]
Remark 15*.*
If , since from(24) it is
[TABLE]
Corollary 24**.**
From the (24) it follows
[TABLE]
[TABLE]
and it assumes its minimum value when .
[TABLE]
Definition 25**.**
*(**Speed up in *) Given , two different decompositions and , and
, where ,
- 2.
**
where and differ only on the number of processing elements, if , then the speed up of with respect to is
[TABLE]
Remark 16*.*
(Ideal Speed up) Since it is always101010 is the sum of the maximum operator time on each row, so can be equal to only if the operators have all the same time.
[TABLE]
then it holds that
[TABLE]
Definition 26**.**
(Speed up in ) The speed up of with respect to is
[TABLE]
4 Algorithms which are in the same equivalence class
We consider algorithms that are in the same equivalence class, i.e. those corresponding to the same decomposition of the problem
Theorem 27**.**
* perfectly decomposed according to the decomposition , and perfectly parallel algorithm that solves it on with , if*
[TABLE]
it follows that:
[TABLE]
Proof.
If is perfectly parallel, then has no empty elements so
[TABLE]
Therefore, from the (25) and 27, it is
[TABLE]
∎
Theorem 28**.**
For all the matrices of algorithms in , it holds
[TABLE]
and
[TABLE]
Moreover, let us consider and two algorithms belonging to , and their matrices and . We have:
;
- 2.
.
Proof.
From inheritance on of dependencies defined on , it is not possible that , therefore . Then there is at least one row of with non-empty elements. Let be the difference between and . Therefore, since and have the same number of non-empty elements, it is .
Similarly, it can be proved that if then , and if then . ∎
Remark 17*.*
The minimum execution time is proportional to the dependency degree of , that is when the number of computing units is equal to the concurrency degree of .
We now define a subset of the equivalence class of . Let be the equivalence relation identifying two algorithms with the same . Then
[TABLE]
i.e. consisting of the representatives of the equivalence classes of 111111For example, we can take the algorithm in , , whose execution matrix has the fewest number of rows..
Let us now consider matrices associated to algorithms belonging to , varying .
The following result defines the speed up of a parallel algorithm with respect to the sequential algorithm belonging to its class.
Theorem 29**.**
Consider with
[TABLE]
It holds
[TABLE]
Proof.
From the (25), (26) and (32), it follows
[TABLE]
∎
Corollary 30**.**
Since , from the (39) it follows that
[TABLE]
Definition 31**.**
(Ideal Speed up in ) We let
[TABLE]
be the ideal speed up.
Let denote the number of rows having not empty elements, and , then it is
[TABLE]
Definition 32**.**
(Total Time of with non empty elements) Let the time of a row with not empty elements elements. The quantity
[TABLE]
is the execution time of the part of with non empty elements on each row.
Remark 18*.*
It holds that
then
Next result shows how the generalized Amdhal’s Law can be derived by using the rows of the execution matrix having at least one non empty element.
Theorem 33**.**
(Generalized Amdhal’s Law) It is
[TABLE]
where
[TABLE]
Proof.
From (39) it is
[TABLE]
By dividing for it follows that
[TABLE]
that is
[TABLE]
∎
Then, the Amdhal’s Law [2] comes out as a particular case of the previous theorem
Corollary 34**.**
(Amdhal’s Law) If we assume that only has rows with element or elements, we have
[TABLE]
where
[TABLE]
Proof.
From (42) it follows that
[TABLE]
where
[TABLE]
and
[TABLE]
If the rows with more than one non empty element have elements, it is
[TABLE]
therefore, if we let we get
[TABLE]
∎
Let denote the cost of . The cost is defined as the product of the execution time and the number of processors utilized [17]. In this mathematical settings it holds that the cost can be written as
[TABLE]
If , from the (27) it holds
[TABLE]
The overhead of is the total time spent by all the processing elements over and above that spent in useful computation.
Definition 35**.**
(Algorithm Overhead) The quantity
[TABLE]
is said overhead of .
Theorem 36**.**
It holds
[TABLE]
Proof.
It holds
[TABLE]
Moreover,
[TABLE]
therefore it follows from (51)
[TABLE]
and the (52) follows. ∎
Definition 37**.**
**(Ideal Overhead in )
From the (52) it follows**
[TABLE]
Let be the efficiency of where .
Theorem 38**.**
Let , denote the dimension of the execution matrix of , it holds that
[TABLE]
Proof.
Since , it follows that
[TABLE]
∎
Definition 39**.**
**(Ideal Efficiency in )
Since , it always is . So let**
[TABLE]
be the ideal efficiency of .
Remark 19*.*
It is worth to note the role of parameters and in (46), (54) and (55). If in there are few operators which are much more time consuming than the others, and then and . The more the operators are and the greater the difference is in (54), or the lower the ratio is in (46) and (55). Hence, the greater the overhead is, the lower the speed up and the efficiency are. This is a consequence of a problem decomposition, associated to not well balanced.
Let us now suppose that the algorithm is perfectly parallel, that is its execution matrix has not any empty element. Since it follows from (40) that
[TABLE]
from (52) that
[TABLE]
from (57)
[TABLE]
Remark 20*.*
If , and , if then the following results hold on:
; 2. 2.
3. 3.
4. 4.
5 Algorithms with operators having the same execution time
We assume that all the operators of the algorithm have the same execution time. For example they are the elementary floating point operations. The execution time is , and without loss of generality we assume that . Hence, it follows that,
[TABLE]
Finally, from (24) it follows that
[TABLE]
Hence, we get
- 2.
if , then
- 3.
- 4.
Finally, if is perfectly decomposed then
[TABLE]
i.e. has the ideal speed up in the classical definition.
Let us now consider matrices associated with algorithms in , varying . The following results hold
and, if , then 2. 2.
3. 3.
Finally, next result relates the overhead to the sparsity degree of the execution matrix.
Theorem 40**.**
Let suppose that
[TABLE]
Given , , of order , let be the number of empty elements of the row of ; it is
[TABLE]
Proof.
It holds that
[TABLE]
then from (51)
[TABLE]
∎
Remark 21*.*
Note that is the sparsity degree of the execution matrix.
Following table collects the expressions of the quantities that we have derived and that characterize the mathematical framework.
Among the decomposition approaches, recursive decomposition is the most suitable for our performance model, especially for a real-world algorithm. In this case, as described in the example below, a problem is solved by first decomposing it into a set of independent sub-problems. Furthermore, each one of these sub-problems is solved by applying a similar decomposition into smaller subproblems followed by a combination of their results, and so on. In this way we get a decomposition matrix whose elements can be subsequently decomposed until the desired level of detail which is considered the most suitable for the subsequent analysis.
Example: Let denote the computational problem of the sum of real numbers and The decomposition matrix is
[TABLE]
If can be decomposed as then
[TABLE]
In the same way, if can be decomposed as and
[TABLE]
We have three decompositions for :
[TABLE]
with the following characteristics, according to the corresponding decomposition matrices:
: cardinality , concurrency degree and dependence degree ,
- 2.
: cardinality , concurrency degree and dependence degree ,
- 3.
: cardinality , concurrency degree and dependence degree .
meaning that the intrinsic concurrency of a problem heavily depends on the decomposition chosen for that problem. Each decomposition has a level of detail depending on the type of subproblems that are considered.
[TABLE]
6 Conclusion
Recent activities of major chip manufacturers show more evidence than ever that future designs of microprocessors and large systems will be heterogeneous in nature, relying on the integration of two major types of components. On the first hand, multi/many-cores CPU technology have been developed and the number of cores will continue to escalate because of the need to pack more and more components on a chip. On the other hand special purpose hardware and accelerators, especially Graphics Processing Units are in commodity production. Finally, reconfigurable architectures such as Field Programmable Gate Arrays offer several parameters such as operating frequency, precision, amount of memory, number of computation units, etc. These parameters define a large design space that must be explored to find efficient solutions [26]. To cope with this scenario, performance analysis of parallel algorithms should be re-evaluated to find out the best-practice algorithm on novel architectures [3, 16, 19, 20, 28, 33]. In this paper we presented a mathematical framework which can be used to get a multilevel description of a parallel algorithm, and we proved that it can be suitable for analysing the mapping of the algorithm on a given machine. The model allows the choice of a level of abstraction of the problem decomposition and of the algorithm determining the level of granularity of the performance analysis. This feature can be very useful for analysing the mapping of the algorithm on novel architectures. We have assumed abstract models for both the algorithms and the architectures and made numerous simplifying assumptions. However, we believe that a simplified parameterized model gives an useful generalization for better understanding algorithms that can run really fast no matter how complicated the underlying computer architecture [15].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor Crummey, and N. R. Tallent, Hpctoolkit: tools for performance analysis of optimized parallel programs, Concurrency and Computation: Practice and Experience, vol. 22, no. 6, pp. 685-701, 2010.
- 2[2] G.M. Amdahl, Validity of the single-processor approach to achieving large scale computing capabilities, in AFIPS Conference Proceedings, vol. 30 (Atlantic City, N.J.. Apr. 18-20). AFIPS Press, Reston. Va., pp. 483-485, 1967.
- 3[3] G. Ballard , J. Demmel , O. Holtz , O. Schwartz, Minimizing Communication in Numerical Linear Algebra. SIAM Journal on Matrix Analysis and Applications. Volume 32, Issue 3, pp 866-901. 2011.
- 4[4] F. Berman, L. Snyder, Mapping parallel algorithms into parallel architectures, Journal of Parallel and Distributed Computing, Vol. 4, N. 5, 1987, pp. 439-458.
- 5[5] F. Berman, The mapping problem in parallel computation, in Mathematical Aspects of Scientific Software, J.R. Rice (Ed.), IMA Volumes in Mathematics and its Applications, Vol. 14, Springer-Verlag, 1988.
- 6[6] A.J. Bernstein, Analysis of programs for parallel processing, IEEE Transactions on Electronic Computers EC-15 (5), pp. 757-763, 1966.
- 7[7] S. H. Bokhari, On the mapping problem, IEEE Transaction on Computers, Vol. 30, N. 31, 1981, pp. 207-214.
- 8[8] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, A Portable Programming Interface for Performance Evaluation on Modern Processors, Int. J. High Perform. Comput. Appl., vol. 14, no. 3, pp. 189-204, 2000.
