Документ взят из кэша поисковой машины. Адрес оригинального документа : http://asa.guru.ru/papers/pap3.ps
Дата изменения: Sat Aug 1 11:24:04 1998
Дата индексирования: Mon Oct 1 19:52:18 2012
Кодировка:

Application of the V--Ray Technology for Optimization of the TRFD
and FLO52 Perfect Club Benchmarks to CRAY Y--MP and CRAY
T3D Supercomputers
Alexander S. Antonov, Vladimir V. Voevodin
Moscow State University, Russia
email: voevodin@vvv.srcc.msu.su
Abstract
The paper shows an application of the socalled V
Ray Technology for optimizing the TRFD and FLO52
Perfect Club Benchmarks to CRAY Y--MP and CRAY
T3D supercomputers. We also discuss briefly the pro
cess of the determination of the potential parallelism of
programs within V--Ray since this part of the technol
ogy played the key role for the successful optimization
of the codes.
1. Introduction
Although the problem of optimizing programs to
parallel computers exists for more than twenty years
its satisfactory solution is not found so far. Several
different approaches were suggested, however, none of
them is able now to cover the whole scope of prob
lems arising in optimization of industrial applications
to vector/parallel as well as massively parallel super
computers.
In this paper we would like to show an application of
a principally new strictly mathematical approach based
on the so called V--Ray technology. This technology in
corporates recent advanced results from the graph the
ory and the theory of static analysis of programs [3, 4]
and allows one to perform an exhaustive analysis of all
required properties of programs starting from the com
monly adopted data dependency and the control flow
analysis up to the optimization of data distribution and
data locality.
Depending on goals, this technology makes possi
ble either to detect and describe parallel properties of
a program or to transform source code in accordance
with peculiarities of a particular parallel computer ar
chitecture.
It should be emphasized that the V--Ray technology
provides a mathematical guarantee of detection of all
required properties within a class of programs (for ex
ample, the whole resource of parallelism) and ensures
that all suggested transformations are strictly equiva
lent even taking into account roundoff errors.
As a numerical example we consider application of
the V--Ray technology for optimizing the TRFD and
FLO52 Perfect Club Benchmarks to CRAY Y--MP and
CRAY T3D computers. We chose the TRFD code ini
tially because of it's poor performance on both the vec
tor and parallel architecture computers. We chose the
FLO52 code because it is a very difficult program to
parallelize and, to our knowledge, there has not been
much success in this area at all.
The current state of implementation of the V--Ray
technology in software tools enabled us to perform the
most important part of the analysis of the codes by
means of the V--Ray system, whereas its forthcoming
versions should significantly facilitate transformation
of source codes into parallel programs.
The original version of TRFD delivers the near 90
Mflop/s performance on the CRAY Y--MP C90 com
puter independently of the actual number of processors
available (we remind that the peak performance of the
CRAY Y--MP C90 computer equals 960 Mflop/s per
processor). The best hand optimization of the code en
ables one to speedup execution up to 140 Mflop/s for a
one processor. It is clear that the original TRFD code
is not only poorly vectorized, but even hand transfor
mations do not lead to any considerable improvement
of the performance.
Is it possible at all to improve performance of this
benchmark using only equivalent transformations of
the code? Running a little ahead we give the answer
right now: the V--Ray technology enables us, first, to
improve performance of this code for one processor of
CRAY Y--MP C90 computer up to 579 Mflop/s, sec

ond, to keep significant growth of performance for mul
tiprocessor Y--MP configurations, and at last to find a
way to a really scalable program with performance 1.7
Gflop/s on 256 processors of massively parallel CRAY
T3D computer.
This paper is organized as follows. Section 2 gives a
flavor of the V--Ray mathematical technology. Sections
3.1 and 3.2 present a brief description of the TRDF and
FLO52 codes and the results of optimization for the
vector parallel (CRAY Y--MP M90/C90) and massively
parallel (CRAY T3D) supercomputers.
2. Some Mathematical Issues of the V--
Ray Technology
Detection of the parallel structure of programs is the
central part in the optimization process. In this sec
tion we describe briefly only the steps related to the
determination of the potential parallelism of programs
leaving other questions like searching for data distribu
tions or program transformations aside.
Step 1. First of all we need to detect all types of de
pendences for an analyzed program. Many existing ap
proaches use dependences between points of iteration
space to analyze and describe structure of programs.
Usually a subset of points is associated with each
point of iteration space in accordance with exactness of
method in use for discovering dependences. However
we need to know more exact information about depen
dences to determine the real structure of the program.
In particular, for each point A of iteration space we
should find the only point B generating dependence.
This point is either lexicographical maximum or mini
mum of just mentioned subset depending on the type of
analyzed dependence. Hence we have to deal with ori
ented graphs where nodes are points of iteration space
and arcs connect two lexicographically nearest points
referenced the same data. Graphs are similar to each
other but distinguished by the definition of the start
and end points of arcs and by the memory access mode
(read/write). As a rule arcs of these graphs represent
lexicographical ascending or descending within itera
tion space.
From mathematical point of view all types of de
pendence graphs are identical therefore we apply one
and the same algorithm for their construction. The
final formal description of these graphs may be treat
ed as a finite set of triples (F i ; \Delta i ; N ). N is a vector
of unknown input integer variables of a program, the
last entry of the vector is 1. \Delta i
is a linear polyhedron
within iteration space whose size and location depend
on the vector N . The polyhedron is defined by a sys
tem of linear inequalities. The function F i is linear and
has the form F i
= J i x + \Phi i N . The matrices J i
, \Phi i
are
constant. The domain of the function F i is the polyhe
dron \Delta i . The polyhedrons \Delta i may be intersected and
may not cover the entire iteration space. The values
of the functions F i are points of iteration space. The
points J i x + \Phi i N and x are the start and end points
of an arc respectively.
We have developed an efficient and fast algorithm to
construct the complete set of triples by a source code, i.
e. to compute coefficients of linear inequalities describ
ing the polyhedrons \Delta i and coefficients of the matrices
J i
, \Phi i
. Unlike many other algorithms it does not apply
timeconsuming integer programming technique. The
key feature is using an important characteristic prop
erty of lexicographical order that we call Lproperty
[3].
Dependence graphs of some programs might be more
complicated. In the general case domain of the func
tions F i is an intersection of polyhedrons \Delta i with some
regular lattices. Parameters of the lattices can be de
termined also by source code in accordance with gener
al theory (the actual version of the V--Ray System does
not support lattice analysis because in practice we have
met them only twice and found a way to process them
by much simpler methods). It should be noted that if
due to some reasons a dependence graph can not be
expressed as a set of triples, than we use triples for its
minimal extension keeping the further analysis without
changes.
The result of the first step is explicit symbolic rep
resentation of all dependences. Exactly this form of
representation enables us to solve many problems nec
essary for efficient exploration of parallel structure of
algorithms and programs.
Step 2. The next step is detection of ParDo loops.
Taking into account explicit formulas for data depen
dences this problem can be solved easily. To examine
ParDo property one may use a few criteria (sufficient
as well as sufficient and necessary) testing some joint
properties of matrices J i
and \Phi i
. The actual number
of loops and nesting structure do not affect seriously
complexity of the criteria. For example,
Statement: for a loop to be a ParDo loop it is neces
sary and sufficient that for any triple of an algorithm
graph of this loop the inclusion \Delta i ae G i be true, where
--- \Delta i is a polyhedron from a triple,
--- G i =
n
f1џi1
i 1 џf1 ,
--- i 1 is a parameter of the loop under analysis,
--- f 1 is the first component of the vector function F i
Innermost ParDo loops are better suited for vector
processing while outermost ones for parallel execution
on multiprocessor computers. Hence sometimes for op

timal mapping of programs onto a particular computer
we would like to use loop interchange. To test whether
it is possible to apply this technique or not we make use
a statement requiring quite trivial analysis of triples:
Statement: in order to interchange two neighbouring
loops in a perfect nest of loops it is necessary and suf
ficient that for any triple of an algorithm graph of this
nest the inclusion \Delta i ae G i be true, where
--- \Delta i is a polyhedron from a triple,
--- G i = ff 2 џ i 2 ,
--- i 2 is a parameter of the inner loop,
--- f 2 is the second component of the function F i
The second step mostly deals with general properties
of programs which on the one hand do not depend on a
computer architecture (the output of this stage might
be a marked original code), but on the other hand can
be easily detected after simple analysis of the explicit
formulas representing data dependences. An interest
ing observation: for the large part of realworld appli
cations this information suffices for the efficient target
optimization. Notice, that advanced analysis of depen
dences should be continued on the third stage if we
want to adjust well program structure with peculiari
ties of a particular computer and compiler.
Step 3. ParDo loops describe coordinate paral
lelism within iteration space. However resource of coor
dinate parallelism could be not sufficient therefore the
main purpose of this step is discovering ``skew'' paral
lelism based on the notion of schedule.
Let an oriented graph be given in iteration space.
A strict (generalized) schedule of the graph is a real
function defined on iteration space and strictly increas
ing (nondecreasing) along arcs of the graph. Schedules
have numerous very important properties. Suppose,
for example, the graph be an algorithm graph. Let us
take any schedule and examine its level surfaces. These
surfaces split iteration space into nonintersecting sets,
where each set corresponds to one level. The schedule
gives equal value for all points of one set. Since the
schedule does not decrease along arcs of the algorithm
graph one may execute operations setbyset accord
ing to growth of schedule values. If the schedule is
strict then points of any set can not be linked by arcs
and hence these operations can be run simultaneously.
In other words it means that search for computational
parallelism in programs we reduce to the investigation
of schedules for its algorithm graph.
The third step is completely intended to determining
and analysis of schedules represented similar to graph
description as a finite set of triples (f i
,r i
, N ). N is
a vector of input integer variables of a program, r i
is a linear polyhedron within iteration space whose
size and location depend on the vector N . The lin
ear functional f i
sets a schedule and has the form
f i =! j i ; x ? + ! ' i ; N ?, where ! \Lambda; \Lambda ? means
a dot product and the vectors j i , ' i are constant. The
domain of the functional f i is polyhedron r i . Unlike
triples of dependence graphs polyhedrons r i are non
intersecting and cover the entire iteration space.
There is certain freedom in determining of the poly
hedrons r i . In particular, we may consider them iden
tical to domains of loop parameters or take into account
that arcs of dependence are actually piecewise and so
on. Assume that we fix polyhedrons r i somehow. A
condition of nondecreasing of schedules along arcs is
equivalent to inequality At џ 0, where t is a vector of
direct sum of all vectors j i and ' i . The matrix A is
computed unambiguously by triples of the graph, poly
hedrons r i and domains of the loop parameters. Hence
the schedule search problem we reduce to searching so
lutions of the system of linear inequalities At џ 0. We
call this system as a system of informational closure.
At the end of the third step we should find quite
representative set of solutions of the system At џ 0 in
view of one more property of schedules. If do not go
into details it can be stated as: if we have found s in
dependent schedules than by means of these schedules
we can transform program to a new form where each
nest of loops with depth no less than s will have at
least s \Gamma 1 ParDo loops.
3. Sample V--Ray Problems
As our sample problems we have chosen two applica
tion codes from the Perfect Benchmarks. The Perfect
Club suite has been used for many years for evaluating
of real performance of parallel computers on industrial
applications. By now there are published a lot of pa
pers (for example, [1, 2]) describing structure of each
code. However, just that high popularity of the suite
seemed attractive to us: it is much more interesting
to produce a nontrivial result in the field where many
researchers have worked very hard.
3.1. Optimization of the TRFD code for YMP
Analysis of TRFD involved the following steps:
1. First, it was necessary to reconstruct TRFD's mul
tidimensional arrays from their compact one di
mensional form. After completion of this step it
became possible to describe TRFD in the terms of
a linear model as is needed by the V--Ray Tech
nology.
2. The most important part of the optimization pro
cess is the determination of the exact information

a a a a a a a a a a a a
a a
a
a a a
a
a a a
a a
a a
Figure 1. Marked loop profile of TRFD.
M90 Peak perform. Baseline Manual opt. V--Ray opt.
CPUs Mflop/s Mflop/s Mflop/s Mflop/s
1 333 56.19 81.72 247
4 1332 54.86 261.64 822
8 2664 54.34 481.03 954
C90 Peak perform. Baseline Manual opt. V--Ray opt.
CPUs Mflop/s Mflop/s Mflop/s Mflop/s
1 960 89.5 139.71 579.7
8 7680 89.6 962.68 2440.5
Table 1. Results for CRAY YMP supercomputers.
al structure of the program. Conversion of the pro
gram to this analyzable form enables us to deter
mine all necessary parallel properties, for example,
the existence or absence of dependency between
statements and of separate parts of the program,
locality of data references, the possibility for array
privatization, and so on.
3. Once we know the exact ``informational structure''
of the code we can easily extract the whole re
source of parallelism. In particular, the total re
source of the socalled coordinate parallelism of
the subroutine OLDA which is similar to the Par
Do property of loops can be illustrated by the
marked loop profile (loops with independent iter
ations are marked by dots at fig. 1).
Note that at least one important conclusion imme
diately follows from this picture: if one relies only
on the structure of the innermost and outermost
loops of the routine then the considerable part of
its resource of parallelism will remain hidden and
unusable.
4. From the previous step, we discovered that TRFD
possesses a large reserve of coordinate parallelism
therefore for this program we could use quite sim
ple loop transformations to increase vector length,
to perform vectorization along the leading dimen
sion of arrays, to remove dependencies from loops,
and to remove superfluous synchronization barri
ers.
5. The analysis of data locality is very important for
efficient execution of programs on any kind of par
allel computers. In practical use, the objective
of this process is to reduce the data transfers be
tween main memory and registers. In terms of the
V--Ray technology, it requires an analysis of data
locality and the expediency of temporary use of
arrays. In this particular TRFD benchmark this
optimization yielded a significant improvement of
performance.
The final results of optimization for CRAY Y--MP
M90/C90 computers are shown in tab. 1 which adopts
the following notation:
ffl 'Peak perform.': stands for the theoretical perfor
mance of the computers;
ffl 'Baseline': corresponds to the performance of the
original (nontransformed) code;
ffl 'Manual opt.': shows previously the best results of
manual optimization of the code;
ffl 'V--Ray opt.': shows the performance after opti
mization of the TRFD code using our approach.
We would like to especially note that all TRFD op
timizations were performed using equivalent Fortran
transformations with no assemble fragments and with
out calls of highly optimized library routines.

3.2. Optimization of the TRFD code for T3D
The method for optimizing the TRFD code for the
T3D computer differs from the methods used for vec
tor/parallel computers. Our objectives for T3D were:
1. To identify a number of parallel computation
branches that would be comparable to the num
ber of processors available;
2. To find a data distribution strategy for the ar
rays of the code that would be consistent with the
code's parallelism. This step is crucial for mini
mizing communication; and
3. To enhance the ``locality of the computations'' and
the ``locality of data'' inside each processor for
more effective use of the alphachip's cache mem
ory.
From the work we did to vectorize TRFD, we al
ready knew that the total potential resource of paral
lelism of the code is extensive. However, while inves
tigating the set of possible data distributions among
processors, we proved that TRFD possessed the fol
lowing crucial properties:
1. Within each individual nest of loops, a data
distribution strategy exists that supports two
dimensional parallelism. Twodimensional paral
lelism implies independence of all iterations of two
loops inside a nest. By using a consistent data
distribution strategy within the loop nest, we can
avoid the need for data transfer between iterations.
2. A data distribution strategy that could be used
across all three nests that would also be consistent
with the use of twodimensional parallelism within
each nest does not exist. Moreover, there does not
exist a uniform distribution for all nests consistent
with the use even of one dimensional parallelism
within each nest.
Following our objective to have a highly scalable
code we decided to use different data distribution
strategies within each of the nests and to perform the
redistribution of data between them.
The parallel version for the CRAY T3D uses Elegant
Mathematics' communication library. The results of
optimization are presented in tab. 2.
3.3. Optimization of the FLO52 code for T3D
Our FLO52 example is a much more complicated
code than TRFD. The original code consists of ap
proximately 25 subroutines. 11 subroutines consume
CRAY T3D V--Ray opt. V--Ray opt.
PEs seconds Mflop/s
4 6.013 71.66
8 3.215 134.01
16 1.702 253.43
32 0.896 480.85
64 0.494 871.44
128 0.301 1435.96
256 0.253 1700.93
Table 2. Results for the CRAY T3D super
computer.
a considerable part of the total time and should be
parallelized.
The final version of FLO52 contains more than 60
communication points and the total number of data
transfers during runtime is more than 35,000, where
4,440 are alltoall.
We show data for three cases (fig. 2). The small
case is the small variant of the Perfect Club suite with
a rectangular X by Y domain. For this case we can
see that performance degradation begins immediately
at the 32 PE configuration. Predictably, the 4X larger
version provided in the Perfect Club Suite moves the
point at which performance degradation begins. We
further extended FLO52 ourselves to include a 16X
larger version of the program and see the degradation
start point moves as expected. Nevertheless, the 16X
larger version is still not so big as it is able to be run
on a single PE.
The most important property of the parallel FLO52
code becomes evident from this figure: the larger prob
lem size, the higher efficiency of the code. The reason of
the poor speedup of the small case is too small problem
size and as a result data transfers completely dominate
computations.
The key point for FLO52 is that, by using V--Ray,
we are certain that all resources of parallelism have
been exploited and that any additional parallelization
efforts using the FLO52 algorithmic approach to so
lution of the application problem would be pointless.
This is extremely valuable information, for example,
for the code manager responsible for implementation
of existing application programs on parallel comput
ers.

Speedup
e
e
e
e
e
e
e
e
e
pp pp pp pp pppppp ppppppppppppppppp pp pp pppppp pp pp pppp pppp pp pp pp pp pppp pp pp pp pp pp pp pp pp pp pp pp pp pp pp
pp pp
pp pp pp ppp pp pppp pp pp pppp ppppppppppp pp pp pp pp pp pp pp pp pp pp pp
e
e
e
e
e
e
e
e
e
pp pp ppp pp pp pp pp pp pp pppppppp pppppp pp pp pp pp pp pp pp pp pppp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp
pp pp
pp
pp
pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp ppp pppp pp ppppppppppp pppppppppp ppppppppp ppppppppppp pppppppppp pp pppppppp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp pp
e
e
e
e
e e e
e
e
1
5
10
15
20
1 2 4 8 16 32 64 128 256
4X\Theta 4Y
Large=2X\Theta2Y
Small=X\ThetaY
Number of PE's
Figure 2. Speedup of FLO52.
4. Conclusions
The process of optimization of programs for paral
lel computers involves many difficult problems relat
ed to the structure of program to be analyzed as well
as to the properties of the target computer architec
ture. To resolve them successfully we use the V--Ray
technology that not only gives keys to their solutions
but also makes possible to understand whether any fur
ther improvement of the performance is feasible. We
have shown the potential of the V--Ray technology us
ing only the TRFD and FLO52 Perfect Benchmarks
and we hope to demonstrate soon its capabilities when
optimizing other benchmarks/applications to the wide
range of parallel computers.
5. Acknowledgments
The authors would like acknowledge CRAY Re
search, Inc. (USA) for providing computer resources
and its specialists J. Brooks, Q. Sheikh and R. Num
rich for the assistance when performing numerical ex
periments.
References
[1] G. Cybenko, L. Kipp, L. Pointer, and D. Kuck. Su
percomputer performance evaluation and the perfect
benchmarks. Technical Report 965, CSRD, Universi
ty of Illinois at Urbana, 1990.
[2] C. M. Grassl. Parallel performance of applications
on supercomputers. Parallel Computing, 17:1257--1273,
1991.
[3] V. V. Voevodin. Mathematical Foundations of Paral
lel Computing, volume 33 of Computer Science Series.
World Sci. Publ. Co., Singapore, 1992.
[4] Vl. V. Voevodin. Theory and practice of parallelism de
tection in sequential programs. Programming and Com
puter Software, 18(3), 1992.